Deferring to differential privacy


The term anonymize is not only a bad verb, it's equally bad at doing what it purports to do, which is make data not personally identifiable. Foiling early efforts at anonymizing data has proved fairly simple, putting at risk a company's freedom to use it for intelligence purposes or to monetize it.


The Simons Foundation recently cited an example in Massachusetts where the state made health records available to researchers after removing personally identifiable references such as name, address and social security number, only to have a grad student from MIT re-identify the data using other public records.


Big data analysts rely in part on the anonymization of data in order to gain access to the data sets it uses as the foundation for their work, and to avoid confrontation with regulators, privacy advocates and consumers.


Simons Science News used the example to introduce a potentially new approach to safeguarding data. It explains how a mathematical technique called "differential privacy" may give researchers access to stores of personal data without threatening privacy. Differential privacy "allows researchers to ask practically any question about a database of sensitive information and provides answers that have been "blurred" so that they reveal virtually nothing about any individual's data--not even whether the individual was in the database in the first place," the article said.


The concept was first introduced by Cynthia Dwork and Frank McSherry of Microsoft (NASDAQ: MSFT) Research Silicon Valley in 2005, along with Kobbi Nissim of Israel's Ben-Gurion University and Adam Smith of Pennsylvania State University. The technique preserves plausible deniability by using fictitious stand-ins for real people.


Much of this work is technical and difficult for non-experts, so researchers are starting to build standardized computer languages that would allow non-experts to release sensitive data in a differentially private way by writing a simple computer program, the article said. The Census Bureau is already using the technology.


MIT researchers still say the solution to release only aggregate information from large groups of people remains susceptible to breaches of privacy, citing an example from 2008 in which a research team demonstrated the dangers of releasing aggregate information from genome-wide association studies.


For more information on differential privacy:
- see the Simons Science News post

Related Articles:
The secret side of big data
Researchers demo free online technique to perform computation, store data
Big data still scaring security experts