Open data and privacy risks

Anonymisation is hard to achieve when there are correlation attacks; and when in amongst millions of items of data someone having access to four random pieces of information can deanonymise over 90% of those records (Singer 2015).

To illustrate the dangers that come with open data: The New York City Taxi and Limousine Commission released a dataset containing the details about every taxi ride (yellow cabs) in New York in 2013, including the pickup and drop off times, locations, fare and tip amounts, as well as anonymized (hashed) versions of the taxi’s license and medallion numbers.  From this (Tockar 2014)  was able to identify the home addresses of frequent visitors to a strip club in the city.

Perfect anonymisation is a myth. There is a tension between the level of usefulness of the data and the risk of privacy being compromised: the less granular the data the less interesting and useful it is for businesses, for policymakers, for researchers and for the public. The problem is that the more granular and detailed the information is, the greater the risk that personally identifiable and potentially highly sensitive information can be revealed.

Risks include:

–          Re-identification

–          False re-identification (When data is partially anonymous, individuals are at risk of having sensitive facts incorrectly connected to them through flawed re-identification techniques.)

–          Jigsaw identification (The ability to identify someone by using two or more different pieces of information from two or more sources-especially when the person’s identity is meant to be secret for legal reasons)

–          The “mosaic effect”/Mosaic theory

There are various risk mitigation techniques that researchers can use, for example to remove low numbers, aggregate data sets.

It isn’t simply a question of whether the information that is made available contains anything that could in and of itself identify a particular individual, because data protection legislation requires that you also take into account whether that information could potentially be combined with something else which together identifies the person. Article 4 (Definitions) of the GDPR 2016/679 says that ”‘personal data’  means any  information relating to  an  identified or  identifiable  natural person (‘data  subject’);  an identifiable natural person is  one  who  can  be  identified,  directly or  indirectly,  in  particular by  reference to  an identifier such as a name, an identification number, location data, an online identifier or  to one or  more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person”

SINGER, N., 2015. With a few bits of data researchers identify “anonymous” people. New York Times, (January 29),.

TOCKAR, A., 2014. Riding with the stars: Passenger privacy in the nyc taxicab dataset. Neustar Research, September, 15.