If you think making up a bogus name or using a fake age on a profile actually makes you harder to link to your profiles on other sites, think again, as researchers have determined how to use location data to link users across domains.
You also should not be comforted when you learn that big data has been stripped of names and personal details; researchers say it is “no guarantee of privacy.”
Columbia University computer science researchers Chris Riederer, Yunsung Kim and Augustin Chaintreau, along with Google researchers Nitish Korula and Silvio Lattanzi, combined their considerable brain power to come up with an algorithm that needs only location data from two apps to identify someone. The researchers recently presented their paper, “Linking Users Across Domains with Location Data: Theory and Validation” (pdf), at the 25th International World Wide Web Conference.
“Many people choose not to identify themselves online,” said Chaintreau, but you are often leaking your location without knowing it. “If I now tell you that your location data makes you recognizable across all of your accounts, how does that change your behavior? This is a question we now have to answer.”
In case you don’t know, the researchers explained that “increasingly often” your location “is captured and recorded for a majority of mobile apps even in the absence of geographical personalization. This considerably expands the number of parties who can collect and exploit the knowledge of a user’s whereabouts. Even when data is recorded sporadically, these datasets are very rich and intimately connected to one’s everyday life; they may present or at least partially reflect our most recognizable patterns.”
Then there’s the dreaded metadata. The Columbia University Data Science Institute explained that location metadata is regarded as so distinctive that “most people can be identified from a few data points within a single data set. With as little as four credit card purchases, individual shoppers can be picked out from among millions of other credit card users.”
The researchers showed that “individuals can be identified with a high degree of confidence by matching their movements across two data sets.” They compared geotagged posts; for example, first they compared Foursquare check-ins to geolocated tweets. The second dataset compared tweets with Instagram posts. The third dataset compared a log of phone calls to credit card transactions.
The authors pointed out the main ingredient of their algorithm is “a new use of misses and repetitions to interpret coincidental records that exploits the sparse property of coupling between Poisson processes.” It is denoted as POIS in the graphic.
The researchers compared their algorithm with “three state-of-the-art reconciliation techniques.” They discussed a variation of the “Netflix Attack” used to de-anonymize Netflix users; it exploits “sparsity, using unique, rare occurrences in two datasets to link users.” In the graphic, it is denoted as “NFLIX.”
The frequency-based technique “approximates the likelihood of visits made in one domain by the frequency of visits for that user in the other domain.” It is referred to as WYCI in the graphic.
The authors also discussed histogram matching, an algorithm that “relies on the density of data, assuming that over time—even in different periods—a unique histogram of user visits will emerge from a user’s behavior.” It is denoted as HIST in the graphic.
“What this really shows is that simply removing identifying information from large-scale data sets is not sufficient,” MIT Media Lab research scientist Yves-Alexandre de Montjoye told Columbia University. “We need to move to a model of privacy-through-security. Instead of anonymizing data and making it public, there should be technical controls over who gets access to the data, how it is used and for what purpose.”
Math geeks who care about privacy might really get into this paper. In the end, the authors concluded:
User data is constantly multiplying across an increasing array of websites, apps and services, as they are eager to share part of their behavior with service providers to receive personalized (and free) services. Users may attempt to deal with the privacy implications through partially or inaccurately filled profile information (such as entering a fake name, age, etc.), or using the privacy settings to “lock down” access. However, such methods are of limited use because commonly collected fields (such as location) that are integral to the service provided may in themselves be sufficient to link this account with other accounts of the same user.
You Are Where You Go tool
In a separate project, Columbia’s Riederer helped develop a “You Are Where You Go” tool so you can check out your digital footprint by linking location-tracking apps such as Foursquare, Instagram and Twitter, as well manually adding locations important to you. The tool also provides interesting tidbits about where you’ve been, such as racial composition of the area and whether it is a high-income or low-income location. As Columbia University put it, “A few simple algorithms process this information to make relatively accurate inferences about your age, ethnicity, income and whether you have kids.”
“People are now sharing their location on a growing number of apps, often without realizing it,” Riederer stated. “Companies no longer have to be very sophisticated to access this data and use it for their own purposes.”