Microsoft Subnet An independent Microsoft community View more

You + Big Data = Not Anonymous; Microsoft develops Differential Privacy for everyone

Microsoft research is developing Differential Privacy technology that would act like a privacy guard and go-between when researchers query databases. It would ensure that no individual could be re-identified, protect privacy by keeping people anonymous in databases, but still help researchers sort big data.

Too often, we hear how researchers can take supposedly anonymized or obfuscated PII (Personally Identifiable Information) data and exploit "linkability threats" to re-identify individuals such as in I know where you are and what you are sharing and deanonymizing you after one mouse click on a website. Sites may try to lock down identifying PII data, but big data in databases, even if stripped of PII, still poses a significant threat to personal privacy. Twelve years ago, researchers proved that by using publicly available information, such as a voter registration database, then it is not too difficult to correlate databases and link real identities with allegedly anonymous individuals. On the Trustworthy Computing blog, Microsoft chief privacy officer Brendon Lynch introduced a research technology called Differential Privacy, a technology that "helps address re-identification and other privacy risks as information is gleaned from a given database."

According to Microsoft's research whitepaper [download PDF] titled Differential Privacy for Everyone, "Differential Privacy (DP) was conceived to deal with privacy threats in this context. That is, to prevent unwanted re-identification and other privacy threats to individuals whose personal information is present in large datasets, while providing useful access to data. Under the DP model, personal information in a large database is not modified and released for analysts to use."

When an analyst poses a question to a database, the question goes through an intermediary piece of software that "acts as a privacy-protecting screen or filter, effectively serving as a privacy guard." Depending on what the software determines the privacy risk of the question is, a certain amount of "noise" and "distortion" is inserted into the answers so as not to re-identify any individual.

Microsoft's Trustworthy Computing Differential Privacy in action example:


Differential Privacy (DP) in action: Analyst sends a query to an intermediate piece of software, the DP guard. The guard assesses the privacy impact of the query using a special algorithm. The guard sends the query to the database, and gets back a clean answer based on data that has not been distorted in any way. The guard then adds the appropriate amount of "noise," scaled to the privacy impact, thus making the answer (hopefully slightly) imprecise in order to protect the confidentiality of the individuals whose information is in the database, and sends the modified response back to the analyst.

Author Javier Salido, perhaps a Superman fan, used Smallville and Bob to illustrate how DP could work. Let's say a hospital "deployed a DP guard for its database that keeps an eye out for privacy." That hospital database includes patients with a potentially life-threatening disease and a researcher wants to narrow down the number of individuals with the disease by region. Eight towns have lots of people with the disease, but only "Bob" in "Smallville" has the disease. If Bob is identified, then his privacy would be breached.

Salido wrote:

To avoid this situation, the DP guard will introduce a random but small level of inaccuracy, or distortion, into the results it serves to the researcher. Thus, instead of reporting one case for Smallville, the guard may report any number close to one. It could be zero, or ½ (yes, this would be a valid noisy response when using DP), or even -1. The researcher will see this and, knowing the privacy guard is doing its job, she will interpret the result as "Smallville has a very small number of cases, possibly even zero." In fact, and in order to maintain privacy, the guard may also report non-zero (but equally small) numbers for some of the towns that really have zero cases.

The number of people with the disease in the eight other towns would also be slightly tweaked by the DP guard to be larger or smaller. However, "the answers reported by the DP guard are accurate enough that they provide valuable information to the researcher, but inaccurate enough that the researcher cannot know if Bob's name is or is not in the database."

The DP guard would also track the cumulative privacy cost of all the questions that have been asked. If the cost reaches the "privacy budget" that was assigned to the database, then it would "raise the alarm and a policy decision can be made by the entity that controls the data, on whether the amount of distortion introduced to answers needs to be increased, or whether the risk is worth the reward." Besides keeping track of privacy budget costs, the "DP guard works as a helpdesk." It could also "avoid scenarios in which the answers to different questions can be combined by multiple analysts in such a way that any individual's privacy is breached eventually."

DP technology is still at a research level. "Microsoft believes that in order for society to reap the full benefits offered by the data age and the creative efforts of researchers and developers, without significantly eroding individual privacy, we will have address a variety of different needs and requirements. For some use cases, leveraging new and innovative privacy-protecting technologies like Differential Privacy will help meet those requirements."

So what you do think of DP tech? It sounds promising to help protect privacy. Additional information about Differential Privacy can be found at Database Privacy and Privacy Integrated Queries.

Like this? Here's more posts:

Follow me on Twitter @PrivacyFanatic

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Must read: 10 new UI features coming to Windows 10