Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?

IF 1.3 4区医学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Methods of Information in Medicine Pub Date : 2024-05-01 Epub Date: 2024-08-13 DOI:10.1055/a-2385-1355

Ileana Montoya Perez, Parisa Movahedi, Valtteri Nieminen, Antti Airola, Tapio Pahikkala

{"title":"Does Differentially Private Synthetic Data Lead to Synthetic Discoveries?","authors":"Ileana Montoya Perez, Parisa Movahedi, Valtteri Nieminen, Antti Airola, Tapio Pahikkala","doi":"10.1055/a-2385-1355","DOIUrl":null,"url":null,"abstract":"Background: Synthetic data have been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential Privacy (DP) is currently considered the gold standard approach for balancing this trade-off.Objectives: The aim of this study is to investigate how trustworthy are group differences discovered by independent sample tests from DP-synthetic data. The evaluation is carried out in terms of the tests' Type I and Type II errors. With the former, we can quantify the tests' validity, i.e., whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries.Methods: We evaluate the Mann-Whitney U test, Student's t-test, chi-squared test, and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n = 500) and a cardiovascular dataset (n = 70,000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms.Conclusion: A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at levels of ϵ ≤ 1. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP Smoothed Histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget (ϵ ≥ 5) in order to have reasonable Type II error levels.","PeriodicalId":49822,"journal":{"name":"Methods of Information in Medicine","volume":" ","pages":"35-51"},"PeriodicalIF":1.3000,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11495942/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Methods of Information in Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/a-2385-1355","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/13 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Synthetic data have been proposed as a solution for sharing anonymized versions of sensitive biomedical datasets. Ideally, synthetic data should preserve the structure and statistical properties of the original data, while protecting the privacy of the individual subjects. Differential Privacy (DP) is currently considered the gold standard approach for balancing this trade-off.

Objectives: The aim of this study is to investigate how trustworthy are group differences discovered by independent sample tests from DP-synthetic data. The evaluation is carried out in terms of the tests' Type I and Type II errors. With the former, we can quantify the tests' validity, i.e., whether the probability of false discoveries is indeed below the significance level, and the latter indicates the tests' power in making real discoveries.

Methods: We evaluate the Mann-Whitney U test, Student's t-test, chi-squared test, and median test on DP-synthetic data. The private synthetic datasets are generated from real-world data, including a prostate cancer dataset (n = 500) and a cardiovascular dataset (n = 70,000), as well as on bivariate and multivariate simulated data. Five different DP-synthetic data generation methods are evaluated, including two basic DP histogram release methods and MWEM, Private-PGM, and DP GAN algorithms.

Conclusion: A large portion of the evaluation results expressed dramatically inflated Type I errors, especially at levels of ϵ ≤ 1. This result calls for caution when releasing and analyzing DP-synthetic data: low p-values may be obtained in statistical tests simply as a byproduct of the noise added to protect privacy. A DP Smoothed Histogram-based synthetic data generation method was shown to produce valid Type I error for all privacy levels tested but required a large original dataset size and a modest privacy budget (ϵ ≥ 5) in order to have reasonable Type II error levels.

查看原文本刊更多论文

差异化私有合成数据会带来合成发现吗？

背景：合成数据是共享敏感生物医学数据集匿名版本的一种解决方案。理想情况下，合成数据应保留原始数据的结构和统计特性，同时保护受试者的个人隐私。目前，差异隐私（DP）被认为是平衡这种权衡的黄金标准方法：本研究的目的是调查通过 DP 合成数据的独立样本测试发现的群体差异的可信度。评估从测试的 I 类和 II 类误差的角度进行。通过前者，我们可以量化检验的有效性，即错误发现的概率是否确实低于显著性水平：我们对 DP 合成数据进行了曼惠尼 U 检验、学生 t 检验、卡方检验和中位检验。私人合成数据集由真实世界数据生成，包括前列腺癌数据集（n=500）和心血管数据集（n=70 000），以及双变量和多变量模拟数据。评估了五种不同的 DP 合成数据生成方法，包括两种基本的 DP 直方图释放方法以及 MWEM、Private-PGM 和 DP GAN 算法：结论：大部分评估结果表明 I 类误差急剧扩大，尤其是在ϵ≤1 的水平上。这一结果要求在发布和分析 DP 合成数据时保持谨慎：在统计测试中可能会获得较低的 p 值，而这仅仅是为保护隐私而添加的噪声的副产品。基于 DP 平滑直方图的合成数据生成方法在所有测试的隐私级别中都能产生有效的 I 类误差，但需要较大的原始数据集规模和适度的隐私预算（ϵ≥ 5），以获得合理的 II 类误差水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Methods of Information in Medicine 医学-计算机：信息系统

CiteScore

3.70

自引率

11.80%

发文量

审稿时长

6-12 weeks

期刊介绍： Good medicine and good healthcare demand good information. Since the journal''s founding in 1962, Methods of Information in Medicine has stressed the methodology and scientific fundamentals of organizing, representing and analyzing data, information and knowledge in biomedicine and health care. Covering publications in the fields of biomedical and health informatics, medical biometry, and epidemiology, the journal publishes original papers, reviews, reports, opinion papers, editorials, and letters to the editor. From time to time, the journal publishes articles on particular focus themes as part of a journal''s issue.