在协作数据科学项目中使用不同的私有合成数据进行机器学习

Michael Holmes, George Theodorakopoulos
{"title":"在协作数据科学项目中使用不同的私有合成数据进行机器学习","authors":"Michael Holmes, George Theodorakopoulos","doi":"10.1145/3407023.3407024","DOIUrl":null,"url":null,"abstract":"As organisations increasingly embrace data science to extract additional value from the data they hold, understanding how ethical and secure data sharing practices effect the utility of models is necessary. For organisations taking first steps towards data science applications, collaborations may involve third parties which intend to design and train models for the data owner to use. However, the disclosure of bulk data sets presents risks in terms of privacy and security. In this work the authors compare classification accuracy of models trained on private data, synthetic data and differentially private synthetic data when tested on a private data hold-out set. The study explores whether models designed and trained using synthetic data can be applied back in to real-world private data environments without redesign or retraining. The study finds that for 33 classification problems, tested using private hold-out data, the accuracy of models trained using synthetic data without privacy diverge by 7%, with standard deviation of 0.06, from models trained and tested with the private data. Models trained with differential privacy diverge by between 8% and 14%, with standard deviation between 0.06 and 0.12. The results suggest that models trained on synthetic data do suffer loss in accuracy, but that performance divergence is fairly uniform across tasks and that divergence between models trained on data produced by private and non-private generators can be minimised.","PeriodicalId":121225,"journal":{"name":"Proceedings of the 15th International Conference on Availability, Reliability and Security","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Towards using differentially private synthetic data for machine learning in collaborative data science projects\",\"authors\":\"Michael Holmes, George Theodorakopoulos\",\"doi\":\"10.1145/3407023.3407024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As organisations increasingly embrace data science to extract additional value from the data they hold, understanding how ethical and secure data sharing practices effect the utility of models is necessary. For organisations taking first steps towards data science applications, collaborations may involve third parties which intend to design and train models for the data owner to use. However, the disclosure of bulk data sets presents risks in terms of privacy and security. In this work the authors compare classification accuracy of models trained on private data, synthetic data and differentially private synthetic data when tested on a private data hold-out set. The study explores whether models designed and trained using synthetic data can be applied back in to real-world private data environments without redesign or retraining. The study finds that for 33 classification problems, tested using private hold-out data, the accuracy of models trained using synthetic data without privacy diverge by 7%, with standard deviation of 0.06, from models trained and tested with the private data. Models trained with differential privacy diverge by between 8% and 14%, with standard deviation between 0.06 and 0.12. The results suggest that models trained on synthetic data do suffer loss in accuracy, but that performance divergence is fairly uniform across tasks and that divergence between models trained on data produced by private and non-private generators can be minimised.\",\"PeriodicalId\":121225,\"journal\":{\"name\":\"Proceedings of the 15th International Conference on Availability, Reliability and Security\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 15th International Conference on Availability, Reliability and Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3407023.3407024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3407023.3407024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

随着组织越来越多地采用数据科学从他们持有的数据中提取额外价值,了解道德和安全的数据共享实践如何影响模型的效用是必要的。对于向数据科学应用迈出第一步的组织来说,合作可能涉及第三方,他们打算设计和培训数据所有者使用的模型。然而,大量数据集的披露在隐私和安全方面存在风险。在这项工作中,作者比较了在私有数据保留集上测试时在私有数据、合成数据和差异私有合成数据上训练的模型的分类准确性。该研究探讨了使用合成数据设计和训练的模型是否可以在不重新设计或重新训练的情况下应用于现实世界的私人数据环境。研究发现,在使用私人保留数据测试的33个分类问题中,使用没有隐私的合成数据训练的模型的准确率与使用私人数据训练和测试的模型相差7%,标准差为0.06。使用差分隐私训练的模型偏差在8%到14%之间,标准差在0.06到0.12之间。结果表明,在合成数据上训练的模型确实在准确性上有损失,但是在不同任务之间的性能差异是相当一致的,并且在私人和非私人生成器产生的数据上训练的模型之间的差异可以最小化。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Towards using differentially private synthetic data for machine learning in collaborative data science projects
As organisations increasingly embrace data science to extract additional value from the data they hold, understanding how ethical and secure data sharing practices effect the utility of models is necessary. For organisations taking first steps towards data science applications, collaborations may involve third parties which intend to design and train models for the data owner to use. However, the disclosure of bulk data sets presents risks in terms of privacy and security. In this work the authors compare classification accuracy of models trained on private data, synthetic data and differentially private synthetic data when tested on a private data hold-out set. The study explores whether models designed and trained using synthetic data can be applied back in to real-world private data environments without redesign or retraining. The study finds that for 33 classification problems, tested using private hold-out data, the accuracy of models trained using synthetic data without privacy diverge by 7%, with standard deviation of 0.06, from models trained and tested with the private data. Models trained with differential privacy diverge by between 8% and 14%, with standard deviation between 0.06 and 0.12. The results suggest that models trained on synthetic data do suffer loss in accuracy, but that performance divergence is fairly uniform across tasks and that divergence between models trained on data produced by private and non-private generators can be minimised.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信