Matteo Giomi, Franziska Boenisch, Christoph Wehmeyer, Borbála Tasnádi
{"title":"A Unified Framework for Quantifying Privacy Risk in Synthetic Data","authors":"Matteo Giomi, Franziska Boenisch, Christoph Wehmeyer, Borbála Tasnádi","doi":"10.56553/popets-2023-0055","DOIUrl":null,"url":null,"abstract":"Synthetic data is often presented as a method for sharing sensitive information in a privacy-preserving manner by reproducing the global statistical properties of the original data without dis closing sensitive information about any individual. In practice, as with other anonymization methods, synthetic data cannot entirely eliminate privacy risks. These residual privacy risks need instead to be ex-post uncovered and assessed. However, quantifying the actual privacy risks of any synthetic dataset is a hard task, given the multitude of facets of data privacy. We present Anonymeter, a statistical framework to jointly quantify different types of privacy risks in synthetic tabular datasets. We equip this framework with attack-based evaluations for the singling out, linkability, and inference risks, which are the three key indicators of factual anonymization according to data protection regulations, such as the European General Data Protection Regulation (GDPR). To the best of our knowledge, we are the first to introduce a coherent and legally aligned evaluation of these three privacy risks for synthetic data, as well as to design privacy attacks which model directly the singling out and linkability risks. We demonstrate the effectiveness of our methods by conducting an extensive set of experiments that measure the privacy risks of data with deliberately inserted privacy leakages, and of synthetic data generated with and without differential privacy. Our results highlight that the three privacy risks reported by our framework scale linearly with the amount of privacy leakage in the data. Furthermore, we observe that synthetic data exhibits the lowest vulnerability against linkability, indicating one-to-one relationships between real and synthetic data records are not preserved. Finally, with a quantitative comparison we demonstrate that Anonymeter outperforms existing synthetic data privacy evaluation frameworks both in terms of detecting privacy leaks, as well as computation speed. To contribute to a privacy-conscious usage of synthetic data, we publish Anonymeter as an open-source library (https://github.com/statice/anonymeter).","PeriodicalId":74556,"journal":{"name":"Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.56553/popets-2023-0055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Synthetic data is often presented as a method for sharing sensitive information in a privacy-preserving manner by reproducing the global statistical properties of the original data without dis closing sensitive information about any individual. In practice, as with other anonymization methods, synthetic data cannot entirely eliminate privacy risks. These residual privacy risks need instead to be ex-post uncovered and assessed. However, quantifying the actual privacy risks of any synthetic dataset is a hard task, given the multitude of facets of data privacy. We present Anonymeter, a statistical framework to jointly quantify different types of privacy risks in synthetic tabular datasets. We equip this framework with attack-based evaluations for the singling out, linkability, and inference risks, which are the three key indicators of factual anonymization according to data protection regulations, such as the European General Data Protection Regulation (GDPR). To the best of our knowledge, we are the first to introduce a coherent and legally aligned evaluation of these three privacy risks for synthetic data, as well as to design privacy attacks which model directly the singling out and linkability risks. We demonstrate the effectiveness of our methods by conducting an extensive set of experiments that measure the privacy risks of data with deliberately inserted privacy leakages, and of synthetic data generated with and without differential privacy. Our results highlight that the three privacy risks reported by our framework scale linearly with the amount of privacy leakage in the data. Furthermore, we observe that synthetic data exhibits the lowest vulnerability against linkability, indicating one-to-one relationships between real and synthetic data records are not preserved. Finally, with a quantitative comparison we demonstrate that Anonymeter outperforms existing synthetic data privacy evaluation frameworks both in terms of detecting privacy leaks, as well as computation speed. To contribute to a privacy-conscious usage of synthetic data, we publish Anonymeter as an open-source library (https://github.com/statice/anonymeter).