MIT researchers show you can be identified by a just few data points

90% of people can be identified from four samples out of 30 days of credit card transaction data from 1.1 million people

social network analysis visualization

If you thought sparse personal metadata - random chunks of data about data - was hard to mine for the purposes of identifying individuals, think again. MIT researchers have just shown that it only requires four data points (the dates and times of purchases) from a 30 day database of credit card purchases by 1.1 million people to identify 90 percent of them.

Metadata has been in the news extensively over the last couple of years mainly due to the Snowden revelations about NSA spying activities. But it's not just the government that wants to know who’s doing what, when, and why because every large commercial corporation also wants the same insights into people’s behavior. But instead of (ostensibly) protecting us from terrorists as the government claims to be doing, the corporations want to figure out better ways to part us from our money. It's obvious that being able to chart consumers' economic perambulations across the commercial landscape is key to being competitive and the actionable insights gained from in-depth, accurate, and timely consumer surveillance can be the difference between a good quarter and a bad quarter.

What's a real concern is how easy tracking consumers has become. An MIT News article, “Privacy challenges” reported (the emphasis is mine):

The data set the researchers analyzed included the names and locations of the shops at which purchases took place, the days on which they took place, and the purchase amounts. Purchases made with the same credit card were all tagged with the same random identification number.

For each identification number — each customer in the data set — the researchers selected purchases at random, then determined how many other customers’ purchase histories contained the same data points. In separate analyses, the researchers varied the number of data points per customer from two to five. Without price information, two data points were still sufficient to identify more than 40 percent of the people in the data set. At the other extreme, five points with price information was enough to identify almost everyone.

That’s pretty amazing and, at first blush, you might think that reducing the amount of data and its quality would improve privacy but you, my friend, would be wrong:

When the researchers also considered coarse-grained information about the prices of purchases, just three data points were enough to identify an even larger percentage of people in the data set. That means that someone with copies of just three of your recent receipts — or one receipt, one Instagram photo of you having coffee with friends, and one tweet about the phone you just bought — would have a 94 percent chance of extracting your credit card records from those of a million other people. This is true, the researchers say, even in cases where no one in the data set is identified by name, address, credit card number, or anything else that we typically think of as personal information.

The bottom line is that with enough computing power and advanced algorithms it appears to be ridiculously easy for anyone who isn’t living in a cave to be tracked, analyzed, and targeted. It’s not the activities of the NSA, the CIA, the FBI, or any other TLA (three letter agency) we should be worried about, it’s the scientists working on Big Data and the corporations which have a huge incentive to use this technology to pigeonhole us so they can sell to us.


Copyright © 2015 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022