TfL Has Been Accidentally Allowing Anyone With A Computer To Track Your Boris Bike Journeys

A hole in Transport for London’s datasets shows why pseudonymised data might not always be anonymous.

1. London’s transport authority has come under fire for releasing data that allowed anyone with a computer to track the movements of people using its bike-share scheme.

Oli Scarff / Getty Images

Information about individuals who use London’s bike-share scheme was publicly available for more than six months, Quartz reports, prompting concerns that there needs to be greater oversight over the government’s release of open data.

The availability of this data meant that anyone who downloaded it was able to track the journeys of individual users.

Although this information did not identify individuals by their name, a London-based software engineer, James Siddle, suggested that “with a little effort, it’s possible to find the actual people who made the journeys.”

Transport for London said it took down the potentially identifiable information following Siddle’s blog post last week.

2. A researcher at the University of Nottingham told BuzzFeed that this incident should never have occurred, particularly with information that is especially personal.

Dr Gilad Rosner, who specialises in digital identity and privacy, said that TfL should be applauded for encouraging the idea of open data but were careless in their approach, particularly with something as sensitive as movement data.

3. Rosner, a researcher at the University of Nottingham, said:


Not only are we talking about TfL who has access to a tremendous amount of users but this is movement data, some of the most sensitive around.

With movement data, we can figure out where are you at night, where you drink, did you go to an STI clinic, all this really sensitive information. So the danger of getting a data release like this wrong in terms of its privacy characteristic is very high.

PetarPaunchev/PetarPaunchev

5. This comes after James Siddle downloaded the data, freely available from TfL’s website, and created visualisations to highlight the potential privacy implications.

Siddle, 38 also suggested in a blog post that “with a little effort, it’s possible to find the actual people who have made the journeys”.

Information from TfL’s datasets include the start and end location as well as journey times. Crucially, until Siddle’s blog post, this information also included a unique customer ID, which means that users could download the datasets and potentially predict a user’s movements.

This means anyone who downloaded the data would have been able to narrow down journeys made by individual commuters.

Siddle claims he informed the department about the privacy issue and a spokesperson for TfL told BuzzFeed that it took down the entire dataset following his blog post.

It re-published the data late last week after removing any sensitive information but emphasised that it would be very difficult to track down individuals.

6. But Siddle’s blog post refuted these claims. “All that’s needed to work out who this profile belongs to is one bit of connecting information,” he wrote.

James Siddle / Via jamessiddle.net

In the visualisation above, purple lines reflect return journeys and orange lines represent single journeys. The thicker the line, the more often the journey was made.

There are a number of conclusions that can be immediately deduced:

First, a lot of journeys are made to Crinan Street, right besides King’s Place, which suggests the commuter could work in the King’s Cross area.

Next, a number of journeys are made to and from both Limehouse and Bow, suggesting that while the commuter lives in Limehouse, they might spend a few evenings in Bow. Filtering down this information emphasises this further as the data shows that journeys made between 6.45 a.m. and 9 a.m. are largely made from Limehouse but occasionally from Bow also.

7. Although a TfL spokesperson told BuzzFeed it would be very difficult to identify specific individuals, Siddle claims that pseudonymised data can become very personal when combined with other datasets.

James Siddle / Via jamessiddle.net

The visualisation above, for example shows the same commuter generally left Crinan Street no later than 3.30 p.m.

Siddle said that people need to be aware of the risks of pseudonymised data – when names might be replaced by identity numbers – noting that although it is legal to have this information available as it didn’t personally identify anyone, gathering multiple datasets could provide a rich background of someone’s life.

Metadata from social network updates, for example, could give away information such as location and time. The commuter could be easily identified if they sent a tweet or a public Facebook status update complaining about the lack of available bikes at his or her local cycle point.

8. He said:


It’s anonymous but it’s in gray area. It’s data that you could argue that it’s okay. We’re generally putting ourselves at risk. When you start to aggregate datasets together, you start to build an intimate picture of someone’s life, which many people probably don’t know is possible.

9. TfL told BuzzFeed that the information was erroneously made available when transferring to their new website last year.

TfL’s General Manager of Cycle Hire, Nick Aldworth, said: “We’re committed to improving transparency across all our services and publish a range of data for customers and stakeholders online.

“Due to an administrative error, anonymised user identification numbers were shown against individual trips made between 22 July 2012 and 2 February 2013.

“The data, which did not identify any individual customers online, was removed as soon as the matter was brought to our attention.”

zefart/zefart

11. But Rosner said that there needs to be greater oversight as the department – and the government in general – collect more data. He told BuzzFeed:


“I’m not going to say that TfL is cavalier but it shows the need to have greater privacy oversight because what’s happening is that the availability of data is increasing”.

“The real question is the next time they release the dataset, who’s checking to make sure that error doesn’t happen again and who’s checking that the methodology for releasing the data is sound from a privacy perspective,” he added.

Check out more articles on BuzzFeed.com!

Siraj Datoo is a political reporter for BuzzFeed News and is based in London.
 
  Your Reaction?
 

    Now Buzzing