Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy

ACM SIGMOD Record Pub Date : 2024-05-14 DOI:10.1145/3665252.3665267

Lucas Rosenblatt, Bernease Herman, Anastasia Holovenko, Wonkwon Lee, Joshua Loftus, Elizabeth McKinnie, Taras Rumezhak, Andrii Stadnik, Bill Howe, Julia Stoyanovich

{"title":"Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy","authors":"Lucas Rosenblatt, Bernease Herman, Anastasia Holovenko, Wonkwon Lee, Joshua Loftus, Elizabeth McKinnie, Taras Rumezhak, Andrii Stadnik, Bill Howe, Julia Stoyanovich","doi":"10.1145/3665252.3665267","DOIUrl":null,"url":null,"abstract":"<p>Differential privacy (DP) data synthesizers are increasingly proposed to afford public release of sensitive information, offering theoretical guarantees for privacy (and, in some cases, utility), but limited empirical evidence of utility in practical settings. Utility is typically measured as the error on representative proxy tasks, such as descriptive statistics, multivariate correlations, the accuracy of trained classifiers, or performance over a query workload. The ability for these results to generalize to practitioners' experience has been questioned in a number of settings, including the U.S. Census. In this paper, we propose an evaluation methodology for synthetic data that avoids assumptions about the representativeness of proxy tasks, instead measuring the likelihood that published conclusions would change had the authors used synthetic data, a condition we call epistemic parity. Our methodology consists of reproducing empirical conclusions of peer-reviewed papers on real, publicly available data, then re-running these experiments a second time on DP synthetic data and comparing the results.</p>","PeriodicalId":501169,"journal":{"name":"ACM SIGMOD Record","volume":"125 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGMOD Record","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3665252.3665267","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Differential privacy (DP) data synthesizers are increasingly proposed to afford public release of sensitive information, offering theoretical guarantees for privacy (and, in some cases, utility), but limited empirical evidence of utility in practical settings. Utility is typically measured as the error on representative proxy tasks, such as descriptive statistics, multivariate correlations, the accuracy of trained classifiers, or performance over a query workload. The ability for these results to generalize to practitioners' experience has been questioned in a number of settings, including the U.S. Census. In this paper, we propose an evaluation methodology for synthetic data that avoids assumptions about the representativeness of proxy tasks, instead measuring the likelihood that published conclusions would change had the authors used synthetic data, a condition we call epistemic parity. Our methodology consists of reproducing empirical conclusions of peer-reviewed papers on real, publicly available data, then re-running these experiments a second time on DP synthetic data and comparing the results.

查看原文本刊更多论文

认识等价性：可重复性作为差异隐私的评估指标

越来越多的人提议使用差异隐私（DP）数据合成器来公开发布敏感信息，这些合成器在理论上保证了隐私（在某些情况下也保证了效用），但在实际应用中效用的经验证据却很有限。效用通常以代表性代理任务的误差来衡量，如描述性统计、多元相关性、训练有素的分类器的准确性或查询工作量的性能。在包括美国人口普查在内的许多环境中，这些结果能否推广到实践者的经验中受到了质疑。在本文中，我们提出了一种评估合成数据的方法，这种方法避免了对代理任务代表性的假设，而是衡量如果作者使用了合成数据，发表的结论会发生变化的可能性，我们称这种情况为认识平价。我们的方法包括在真实、公开的数据上重现同行评议论文的经验结论，然后在 DP 合成数据上第二次重新运行这些实验并比较结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM SIGMOD Record

自引率

0.00%

发文量