频率表的最优微分隐私过程的推理

Journal of data science : JDS Pub Date : 2022-01-01 DOI:10.6339/22-jds1044

Chengcheng Li, Na Wang, Gongjun Xu

{"title":"频率表的最优微分隐私过程的推理","authors":"Chengcheng Li, Na Wang, Gongjun Xu","doi":"10.6339/22-jds1044","DOIUrl":null,"url":null,"abstract":"When releasing data to the public, a vital concern is the risk of exposing personal information of the individuals who have contributed to the data set. Many mechanisms have been proposed to protect individual privacy, though less attention has been dedicated to practically conducting valid inferences on the altered privacy-protected data sets. For frequency tables, the privacy-protection-oriented perturbations often lead to negative cell counts. Releasing such tables can undermine users’ confidence in the usefulness of such data sets. This paper focuses on releasing one-way frequency tables. We recommend an optimal mechanism that satisfies ϵ-differential privacy (DP) without suffering from having negative cell counts. The procedure is optimal in the sense that the expected utility is maximized under a given privacy constraint. Valid inference procedures for testing goodness-of-fit are also developed for the DP privacy-protected data. In particular, we propose a de-biased test statistic for the optimal procedure and derive its asymptotic distribution. In addition, we also introduce testing procedures for the commonly used Laplace and Gaussian mechanisms, which provide a good finite sample approximation for the null distributions. Moreover, the decaying rate requirements for the privacy regime are provided for the inference procedures to be valid. We further consider common users’ practices such as merging related or neighboring cells or integrating statistical information obtained across different data sources and derive valid testing procedures when these operations occur. Simulation studies show that our inference results hold well even when the sample size is relatively small. Comparisons with the current field standards, including the Laplace, the Gaussian (both with/without post-processing of replacing negative cell counts with zeros), and the Binomial-Beta McClure-Reiter mechanisms, are carried out. In the end, we apply our method to the National Center for Early Development and Learning’s (NCEDL) multi-state studies data to demonstrate its practical applicability.","PeriodicalId":73699,"journal":{"name":"Journal of data science : JDS","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Inference for Optimal Differential Privacy Procedures for Frequency Tables\",\"authors\":\"Chengcheng Li, Na Wang, Gongjun Xu\",\"doi\":\"10.6339/22-jds1044\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"When releasing data to the public, a vital concern is the risk of exposing personal information of the individuals who have contributed to the data set. Many mechanisms have been proposed to protect individual privacy, though less attention has been dedicated to practically conducting valid inferences on the altered privacy-protected data sets. For frequency tables, the privacy-protection-oriented perturbations often lead to negative cell counts. Releasing such tables can undermine users’ confidence in the usefulness of such data sets. This paper focuses on releasing one-way frequency tables. We recommend an optimal mechanism that satisfies ϵ-differential privacy (DP) without suffering from having negative cell counts. The procedure is optimal in the sense that the expected utility is maximized under a given privacy constraint. Valid inference procedures for testing goodness-of-fit are also developed for the DP privacy-protected data. In particular, we propose a de-biased test statistic for the optimal procedure and derive its asymptotic distribution. In addition, we also introduce testing procedures for the commonly used Laplace and Gaussian mechanisms, which provide a good finite sample approximation for the null distributions. Moreover, the decaying rate requirements for the privacy regime are provided for the inference procedures to be valid. We further consider common users’ practices such as merging related or neighboring cells or integrating statistical information obtained across different data sources and derive valid testing procedures when these operations occur. Simulation studies show that our inference results hold well even when the sample size is relatively small. Comparisons with the current field standards, including the Laplace, the Gaussian (both with/without post-processing of replacing negative cell counts with zeros), and the Binomial-Beta McClure-Reiter mechanisms, are carried out. In the end, we apply our method to the National Center for Early Development and Learning’s (NCEDL) multi-state studies data to demonstrate its practical applicability.\",\"PeriodicalId\":73699,\"journal\":{\"name\":\"Journal of data science : JDS\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of data science : JDS\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.6339/22-jds1044\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of data science : JDS","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.6339/22-jds1044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在向公众发布数据时，一个至关重要的问题是有可能暴露为数据集做出贡献的个人的个人信息。已经提出了许多机制来保护个人隐私，尽管很少有人关注对改变的隐私保护数据集进行实际有效的推断。对于频率表，隐私保护导向的扰动通常导致负细胞计数。发布这样的表格会削弱用户对这些数据集有用性的信心。本文的重点是发布单向频率表。我们推荐一个最优的机制，满足ϵ-differential隐私(DP)，而不会遭受负细胞计数。这个过程是最优的，因为在给定的隐私约束下，预期效用是最大化的。对于DP隐私保护数据，还开发了测试拟合优度的有效推理程序。特别地，我们提出了最优过程的去偏检验统计量，并推导了它的渐近分布。此外，我们还介绍了常用的拉普拉斯和高斯机制的测试程序，它们为零分布提供了良好的有限样本近似。此外，还提供了隐私制度的衰减率要求，以使推理过程有效。我们进一步考虑常见的用户实践，例如合并相关或相邻的单元格，或集成跨不同数据源获得的统计信息，并在这些操作发生时推导出有效的测试过程。仿真研究表明，即使在样本量相对较小的情况下，我们的推断结果也保持良好。与当前的领域标准进行了比较，包括拉普拉斯，高斯(都有/没有用零替换负细胞计数的后处理)和二项式- β麦克卢尔-瑞特机制。最后，我们将我们的方法应用于国家早期发展和学习中心(NCEDL)的多州研究数据，以证明其实际适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Inference for Optimal Differential Privacy Procedures for Frequency Tables

When releasing data to the public, a vital concern is the risk of exposing personal information of the individuals who have contributed to the data set. Many mechanisms have been proposed to protect individual privacy, though less attention has been dedicated to practically conducting valid inferences on the altered privacy-protected data sets. For frequency tables, the privacy-protection-oriented perturbations often lead to negative cell counts. Releasing such tables can undermine users’ confidence in the usefulness of such data sets. This paper focuses on releasing one-way frequency tables. We recommend an optimal mechanism that satisfies ϵ-differential privacy (DP) without suffering from having negative cell counts. The procedure is optimal in the sense that the expected utility is maximized under a given privacy constraint. Valid inference procedures for testing goodness-of-fit are also developed for the DP privacy-protected data. In particular, we propose a de-biased test statistic for the optimal procedure and derive its asymptotic distribution. In addition, we also introduce testing procedures for the commonly used Laplace and Gaussian mechanisms, which provide a good finite sample approximation for the null distributions. Moreover, the decaying rate requirements for the privacy regime are provided for the inference procedures to be valid. We further consider common users’ practices such as merging related or neighboring cells or integrating statistical information obtained across different data sources and derive valid testing procedures when these operations occur. Simulation studies show that our inference results hold well even when the sample size is relatively small. Comparisons with the current field standards, including the Laplace, the Gaussian (both with/without post-processing of replacing negative cell counts with zeros), and the Binomial-Beta McClure-Reiter mechanisms, are carried out. In the end, we apply our method to the National Center for Early Development and Learning’s (NCEDL) multi-state studies data to demonstrate its practical applicability.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of data science : JDS

自引率

0.00%

发文量