Researchers at Imperial College London published a paper in Nature Communications on Tuesday that explored how inadequate current techniques to anonymize datasets are. Before a company shares a dataset, they will remove identifying information such as names and email addresses, but the researchers were able to game this system.
Using a machine learning model and datasets that included up to 15 identifiable characteristics—such as age, gender, and marital status—the researchers were able to accurately reidentify 99.98 percent of Americans in an anonymized dataset, according to the study. For their analyses, the researchers used 210 different data sets that were gathered from five sources including the U.S. government that featured information on more than 11 million individuals. Specifically, the researchers define their findings as a successful effort to propose and validate “a statistical model to quantify the likelihood for a re-identification attempt to be successful, even if the disclosed dataset is heavily incomplete.”
[…]Even the hypothetical illustrated by the researchers in the study isn’t a distant fiction. In June of this year, a patient at the University of Chicago Medical Center filed a class-action lawsuit against both the private research university and Google for the former sharing his data with the latter without his consent. The medical center allegedly de-identified the dataset, but still gave Google records with the patient’s height, weight, vital signs, information on diseases they have, medical procedures they’ve undergone, medications they are on, and date stamps. The complaint pointed out that aside from the breach of privacy in sharing intimate data without a patient’s consent, that even if it was in some way anonymized, the tools available to a powerful tech corporation make it pretty easy for them to reverse engineer that information and identify a patient.
“Companies and governments have downplayed the risk of re-identication by arguing that the datasets they sell are always incomplete,” de Montjoye said in a statement. “Our findings contradict this and demonstrate that an attacker could easily and accurately estimate the likelihood that the record they found belongs to the person they are looking for.”