基于联邦生成自编码器的保密性高维数据采集

Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium Pub Date : 2021-11-20 DOI:10.2478/popets-2022-0024

Xue Jiang, Xuebing Zhou, Jens Grossklags

{"title":"基于联邦生成自编码器的保密性高维数据采集","authors":"Xue Jiang, Xuebing Zhou, Jens Grossklags","doi":"10.2478/popets-2022-0024","DOIUrl":null,"url":null,"abstract":"Abstract Business intelligence and AI services often involve the collection of copious amounts of multidimensional personal data. Since these data usually contain sensitive information of individuals, the direct collection can lead to privacy violations. Local differential privacy (LDP) is currently considered a state-ofthe-art solution for privacy-preserving data collection. However, existing LDP algorithms are not applicable to high-dimensional data; not only because of the increase in computation and communication cost, but also poor data utility. In this paper, we aim at addressing the curse-of-dimensionality problem in LDP-based high-dimensional data collection. Based on the idea of machine learning and data synthesis, we propose DP-Fed-Wae, an efficient privacy-preserving framework for collecting high-dimensional categorical data. With the combination of a generative autoencoder, federated learning, and differential privacy, our framework is capable of privately learning the statistical distributions of local data and generating high utility synthetic data on the server side without revealing users’ private information. We have evaluated the framework in terms of data utility and privacy protection on a number of real-world datasets containing 68–124 classification attributes. We show that our framework outperforms the LDP-based baseline algorithms in capturing joint distributions and correlations of attributes and generating high-utility synthetic data. With a local privacy guarantee ∈ = 8, the machine learning models trained with the synthetic data generated by the baseline algorithm cause an accuracy loss of 10% ~ 30%, whereas the accuracy loss is significantly reduced to less than 3% and at best even less than 1% with our framework. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing high-dimensional data while striking a satisfactory utility-privacy balance.","PeriodicalId":74556,"journal":{"name":"Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium","volume":"2022 1","pages":"481 - 500"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Privacy-Preserving High-dimensional Data Collection with Federated Generative Autoencoder\",\"authors\":\"Xue Jiang, Xuebing Zhou, Jens Grossklags\",\"doi\":\"10.2478/popets-2022-0024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Business intelligence and AI services often involve the collection of copious amounts of multidimensional personal data. Since these data usually contain sensitive information of individuals, the direct collection can lead to privacy violations. Local differential privacy (LDP) is currently considered a state-ofthe-art solution for privacy-preserving data collection. However, existing LDP algorithms are not applicable to high-dimensional data; not only because of the increase in computation and communication cost, but also poor data utility. In this paper, we aim at addressing the curse-of-dimensionality problem in LDP-based high-dimensional data collection. Based on the idea of machine learning and data synthesis, we propose DP-Fed-Wae, an efficient privacy-preserving framework for collecting high-dimensional categorical data. With the combination of a generative autoencoder, federated learning, and differential privacy, our framework is capable of privately learning the statistical distributions of local data and generating high utility synthetic data on the server side without revealing users’ private information. We have evaluated the framework in terms of data utility and privacy protection on a number of real-world datasets containing 68–124 classification attributes. We show that our framework outperforms the LDP-based baseline algorithms in capturing joint distributions and correlations of attributes and generating high-utility synthetic data. With a local privacy guarantee ∈ = 8, the machine learning models trained with the synthetic data generated by the baseline algorithm cause an accuracy loss of 10% ~ 30%, whereas the accuracy loss is significantly reduced to less than 3% and at best even less than 1% with our framework. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing high-dimensional data while striking a satisfactory utility-privacy balance.\",\"PeriodicalId\":74556,\"journal\":{\"name\":\"Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium\",\"volume\":\"2022 1\",\"pages\":\"481 - 500\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2478/popets-2022-0024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/popets-2022-0024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

摘要商业智能和人工智能服务通常涉及收集大量多维个人数据。由于这些数据通常包含个人的敏感信息，直接收集可能会导致侵犯隐私。局部差分隐私（LDP）目前被认为是一种最先进的隐私保护数据收集解决方案。然而，现有的LDP算法不适用于高维数据；这不仅是因为计算和通信成本的增加，而且数据的实用性较差。本文旨在解决基于LDP的高维数据采集中的维数诅咒问题。基于机器学习和数据合成的思想，我们提出了一种高效的隐私保护框架DP-Fede-Wae，用于收集高维分类数据。通过将生成自动编码器、联合学习和差分隐私相结合，我们的框架能够私下学习本地数据的统计分布，并在服务器端生成高效用的合成数据，而不会泄露用户的私人信息。我们在包含68–124个分类属性的多个真实世界数据集上，从数据实用性和隐私保护方面对该框架进行了评估。我们表明，我们的框架在捕获属性的联合分布和相关性以及生成高效用合成数据方面优于基于LDP的基线算法。在局部隐私保证∈=8的情况下，使用基线算法生成的合成数据训练的机器学习模型会导致10%~30%的准确度损失，而使用我们的框架，准确度损失显著降低到3%以下，最多甚至低于1%。大量的实验结果证明了我们的框架在合成高维数据方面的能力和效率，同时达到了令人满意的效用-隐私平衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Privacy-Preserving High-dimensional Data Collection with Federated Generative Autoencoder

Abstract Business intelligence and AI services often involve the collection of copious amounts of multidimensional personal data. Since these data usually contain sensitive information of individuals, the direct collection can lead to privacy violations. Local differential privacy (LDP) is currently considered a state-ofthe-art solution for privacy-preserving data collection. However, existing LDP algorithms are not applicable to high-dimensional data; not only because of the increase in computation and communication cost, but also poor data utility. In this paper, we aim at addressing the curse-of-dimensionality problem in LDP-based high-dimensional data collection. Based on the idea of machine learning and data synthesis, we propose DP-Fed-Wae, an efficient privacy-preserving framework for collecting high-dimensional categorical data. With the combination of a generative autoencoder, federated learning, and differential privacy, our framework is capable of privately learning the statistical distributions of local data and generating high utility synthetic data on the server side without revealing users’ private information. We have evaluated the framework in terms of data utility and privacy protection on a number of real-world datasets containing 68–124 classification attributes. We show that our framework outperforms the LDP-based baseline algorithms in capturing joint distributions and correlations of attributes and generating high-utility synthetic data. With a local privacy guarantee ∈ = 8, the machine learning models trained with the synthetic data generated by the baseline algorithm cause an accuracy loss of 10% ~ 30%, whereas the accuracy loss is significantly reduced to less than 3% and at best even less than 1% with our framework. Extensive experimental results demonstrate the capability and efficiency of our framework in synthesizing high-dimensional data while striking a satisfactory utility-privacy balance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings on Privacy Enhancing Technologies. Privacy Enhancing Technologies Symposium

自引率

0.00%

发文量

审稿时长

16 weeks