REIDENT

Assessment of Re-identification Risks with Bayesian Probabilistic Programming

REIDENT people. From left to right: Andrzej Wasowski, Christian Probst, Willard Rafnsson and Raúl Pardo.


Project Description

Sharing data is the foundation of open science, helping efficiency and reproducibility of research, and a fertile ground for a vibrant new data-driven economy. Data analysis is boosting research in medicine, artificial intelligence and many other domains. Moreover, it is becoming a key factor in the economy of big companies such Google or Facebook.

Yet, whatever value created by sharing data, we cannot ignore the privacy risks it poses. Many examples show that even in pseudo-anonymized datasets it is easy to re-identify concrete individuals. Differential privacy has become the state of the art method for giving a form of re-identification protection to the individuals in a dataset. Unfortunately, it relies on a very indirect mathematical definition, which hinders data sharing, as neither scientist, lawmakers, nor lay users can understand the guarantees differential privacy provides. This lack of understanding may result in undesirable disclosure of personal data.

In the REIDENT project, we explore the use of Bayesian inference to asses potential privacy risks in these systems. Bayesian inference has been widely studied and is well-understood, thus it avoids the problem of relaying on a cumbersome notion of privacy such as that of differential privacy. Furthermore, Bayesian inference enjoys a broad range of tool support. There exist plenty of probabilistic programming frameworks (for instance, PyMC3, Figaro, Theano, Edward, Anglican, HackPL, etc.) that can be used as a basis for building automatic privacy risk analysis tools. In a nutshell, our approach provides two advantages which current techniques lack:

  1. An explainable notion of privacy, which directly contributes to the requirements of privacy regulations such as the General Data Protection Regulation (GDPR), thus enabling a sustainable development of data-driven economy.
  2. A method for data scientists to explore risks of re-identification before sharing data or results.

If successful, this project will not only help to better assess privacy risks associated to data sharing, but also open an exciting research agenda on using probabilistic programming and statistics for program analysis.

Participants
Andrzej WasowskiFull Professor (PI)
Willard RafnssonAssistant Professor
Christian ProbstFull Professor
Raúl PardoPostdoc

Contact: Andrzej Wasowski (wasowski@itu.dk)

Acknowledgements

This research project is funded by the Villum Foundation. (September 2018 — August 2021).