在协作数据科学项目中使用不同的私有合成数据进行机器学习

Proceedings of the 15th International Conference on Availability, Reliability and Security Pub Date : 2020-08-25 DOI:10.1145/3407023.3407024

Michael Holmes, George Theodorakopoulos

{"title":"在协作数据科学项目中使用不同的私有合成数据进行机器学习","authors":"Michael Holmes, George Theodorakopoulos","doi":"10.1145/3407023.3407024","DOIUrl":null,"url":null,"abstract":"As organisations increasingly embrace data science to extract additional value from the data they hold, understanding how ethical and secure data sharing practices effect the utility of models is necessary. For organisations taking first steps towards data science applications, collaborations may involve third parties which intend to design and train models for the data owner to use. However, the disclosure of bulk data sets presents risks in terms of privacy and security. In this work the authors compare classification accuracy of models trained on private data, synthetic data and differentially private synthetic data when tested on a private data hold-out set. The study explores whether models designed and trained using synthetic data can be applied back in to real-world private data environments without redesign or retraining. The study finds that for 33 classification problems, tested using private hold-out data, the accuracy of models trained using synthetic data without privacy diverge by 7%, with standard deviation of 0.06, from models trained and tested with the private data. Models trained with differential privacy diverge by between 8% and 14%, with standard deviation between 0.06 and 0.12. The results suggest that models trained on synthetic data do suffer loss in accuracy, but that performance divergence is fairly uniform across tasks and that divergence between models trained on data produced by private and non-private generators can be minimised.","PeriodicalId":121225,"journal":{"name":"Proceedings of the 15th International Conference on Availability, Reliability and Security","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Towards using differentially private synthetic data for machine learning in collaborative data science projects\",\"authors\":\"Michael Holmes, George Theodorakopoulos\",\"doi\":\"10.1145/3407023.3407024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As organisations increasingly embrace data science to extract additional value from the data they hold, understanding how ethical and secure data sharing practices effect the utility of models is necessary. For organisations taking first steps towards data science applications, collaborations may involve third parties which intend to design and train models for the data owner to use. However, the disclosure of bulk data sets presents risks in terms of privacy and security. In this work the authors compare classification accuracy of models trained on private data, synthetic data and differentially private synthetic data when tested on a private data hold-out set. The study explores whether models designed and trained using synthetic data can be applied back in to real-world private data environments without redesign or retraining. The study finds that for 33 classification problems, tested using private hold-out data, the accuracy of models trained using synthetic data without privacy diverge by 7%, with standard deviation of 0.06, from models trained and tested with the private data. Models trained with differential privacy diverge by between 8% and 14%, with standard deviation between 0.06 and 0.12. The results suggest that models trained on synthetic data do suffer loss in accuracy, but that performance divergence is fairly uniform across tasks and that divergence between models trained on data produced by private and non-private generators can be minimised.\",\"PeriodicalId\":121225,\"journal\":{\"name\":\"Proceedings of the 15th International Conference on Availability, Reliability and Security\",\"volume\":\"46 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 15th International Conference on Availability, Reliability and Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3407023.3407024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3407023.3407024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

随着组织越来越多地采用数据科学从他们持有的数据中提取额外价值，了解道德和安全的数据共享实践如何影响模型的效用是必要的。对于向数据科学应用迈出第一步的组织来说，合作可能涉及第三方，他们打算设计和培训数据所有者使用的模型。然而，大量数据集的披露在隐私和安全方面存在风险。在这项工作中，作者比较了在私有数据保留集上测试时在私有数据、合成数据和差异私有合成数据上训练的模型的分类准确性。该研究探讨了使用合成数据设计和训练的模型是否可以在不重新设计或重新训练的情况下应用于现实世界的私人数据环境。研究发现，在使用私人保留数据测试的33个分类问题中，使用没有隐私的合成数据训练的模型的准确率与使用私人数据训练和测试的模型相差7%，标准差为0.06。使用差分隐私训练的模型偏差在8%到14%之间，标准差在0.06到0.12之间。结果表明，在合成数据上训练的模型确实在准确性上有损失，但是在不同任务之间的性能差异是相当一致的，并且在私人和非私人生成器产生的数据上训练的模型之间的差异可以最小化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards using differentially private synthetic data for machine learning in collaborative data science projects

As organisations increasingly embrace data science to extract additional value from the data they hold, understanding how ethical and secure data sharing practices effect the utility of models is necessary. For organisations taking first steps towards data science applications, collaborations may involve third parties which intend to design and train models for the data owner to use. However, the disclosure of bulk data sets presents risks in terms of privacy and security. In this work the authors compare classification accuracy of models trained on private data, synthetic data and differentially private synthetic data when tested on a private data hold-out set. The study explores whether models designed and trained using synthetic data can be applied back in to real-world private data environments without redesign or retraining. The study finds that for 33 classification problems, tested using private hold-out data, the accuracy of models trained using synthetic data without privacy diverge by 7%, with standard deviation of 0.06, from models trained and tested with the private data. Models trained with differential privacy diverge by between 8% and 14%, with standard deviation between 0.06 and 0.12. The results suggest that models trained on synthetic data do suffer loss in accuracy, but that performance divergence is fairly uniform across tasks and that divergence between models trained on data produced by private and non-private generators can be minimised.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 15th International Conference on Availability, Reliability and Security

自引率

0.00%

发文量