A method that can estimate whether a person can be re-identified from an incomplete, anonymized dataset is presented in Nature Communications. The paper suggests that current methods of anonymization and data sharing may be inadequate to protect individual privacy or satisfy requirements set by data protection laws, such as the European General Data Protection Regulation.
Data science and artificial intelligence promise to revolutionize many aspects of our lives, including medicine and health care, business and governance. These methods depend on large-scale, detailed and individual-level data, the collection and sharing of which has raised concerns about individual privacy. Anonymization and the release of partial datasets have been used to address privacy concerns. However, the successful re-identification of anonymized datasets recently, including browsing histories, mobile phone and credit card data, have shown that these practices may be inadequate.
Yves-Alexandre de Montjoye and colleagues created a statistical method that enables accurate estimation of the likelihood for individuals to be correctly re-identified in any anonymized dataset. The authors found that knowing only a few attributes, such as post code, date of birth, gender and number of children, is often sufficient to re-identify individuals with high confidence, even if the dataset is incomplete. The likelihood of identification quickly increases with the number of known attributes. For example, 99.98% of people in Massachusetts would be identifiable based on 15 demographic attributes. Releasing only a sampled or partial dataset is therefore not sufficient to protect individual privacy, they conclude.