Demonstrating just how challenging it is to keep private data secure, researchers from Imperial College London have presented an attack on a new data privacy system called Diffix, whose breakthrough technology has recently been commercialized and approved by the French data protection authority CNIL to satisfy the full criteria for GDPR (General Data Protection Regulation)—the new EU data protection law set to go into effect at the end of May.
But now the researchers have shown that, using only five attributes of an individual and asking 10 carefully chosen questions to the query-based system, it's possible to determine the individual's private attributes with up to 99% accuracy.
The researchers, Andrea Gadotti, Florimond Houssiau, Luc Rocher, and Yves-Alexandre de Montjoye from the departments of computing and data science at Imperial College London, have written a paper on their noise-exploitation attack on Diffix, which was developed by the German start-up Aircloak. Diffix uses technology pioneered by researchers at the Max Planck Institute over several years.
As the researchers at Imperial College London explain, the purpose of their attack is to highlight the need for both the full transparency of all new data privacy systems, along with a community that has full access to the techniques in order to detect and discuss potential vulnerabilities.
"We need to accept that no system is perfect," the researchers write in their blog at the Computational Privacy Group. "There will be attacks, and some of them will succeed. We need to prepare for this and learn from best practices in security: ensuring that several layers of security exist, to not have all the data in one place (what Jean-Pierre Hubaux calls the Fort Knox approach), etc. We also need standards and systems to be completely transparent and open. Building secure systems requires anyone to be able to review the code without technical or legal barriers, propose solutions and build upon existing work."
Data privacy protection systems are rapidly evolving, with traditional approaches based on data anonymization now rendered obsolete, due in part to the large sizes and uses of modern datasets. Alternatives to data anonymization, such as query-based systems like the one used by Diffix, provide a promising new solution. In these systems, users can ask questions about the data system and receive aggregate results as answers. This method aims to find a balance between using the data for useful purposes, such as performing research and fighting disease, while maintaining individual privacy.
In Diffix, the privacy protection comes in large part from the addition of noise. Somewhat ironically, the researchers exploit this noise in their attack in order to infer private information of individuals in the database. To do this, they implement an intersection attack, in which they pose two queries that ask for the number of individuals in the dataset who meet certain conditions. The two queries differ by only one condition, so that by calculating the difference between the results it's possible to cancel out part of the noise. The researchers showed that, by obtaining five pairs of these results, and then performing a statistical test (the likelihood ratio test), it's possible to determine certain information about an individual—for example, whether or not the person has HIV.
Although the researchers point out a few limitations of their attack, they believe that it exposes a serious vulnerability of the system, and provides a reminder that privacy protection will continue to be an ongoing challenge.