{"title":"Towards using differentially private synthetic data for machine learning in collaborative data science projects","authors":"Michael Holmes, George Theodorakopoulos","doi":"10.1145/3407023.3407024","DOIUrl":null,"url":null,"abstract":"As organisations increasingly embrace data science to extract additional value from the data they hold, understanding how ethical and secure data sharing practices effect the utility of models is necessary. For organisations taking first steps towards data science applications, collaborations may involve third parties which intend to design and train models for the data owner to use. However, the disclosure of bulk data sets presents risks in terms of privacy and security. In this work the authors compare classification accuracy of models trained on private data, synthetic data and differentially private synthetic data when tested on a private data hold-out set. The study explores whether models designed and trained using synthetic data can be applied back in to real-world private data environments without redesign or retraining. The study finds that for 33 classification problems, tested using private hold-out data, the accuracy of models trained using synthetic data without privacy diverge by 7%, with standard deviation of 0.06, from models trained and tested with the private data. Models trained with differential privacy diverge by between 8% and 14%, with standard deviation between 0.06 and 0.12. The results suggest that models trained on synthetic data do suffer loss in accuracy, but that performance divergence is fairly uniform across tasks and that divergence between models trained on data produced by private and non-private generators can be minimised.","PeriodicalId":121225,"journal":{"name":"Proceedings of the 15th International Conference on Availability, Reliability and Security","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 15th International Conference on Availability, Reliability and Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3407023.3407024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
As organisations increasingly embrace data science to extract additional value from the data they hold, understanding how ethical and secure data sharing practices effect the utility of models is necessary. For organisations taking first steps towards data science applications, collaborations may involve third parties which intend to design and train models for the data owner to use. However, the disclosure of bulk data sets presents risks in terms of privacy and security. In this work the authors compare classification accuracy of models trained on private data, synthetic data and differentially private synthetic data when tested on a private data hold-out set. The study explores whether models designed and trained using synthetic data can be applied back in to real-world private data environments without redesign or retraining. The study finds that for 33 classification problems, tested using private hold-out data, the accuracy of models trained using synthetic data without privacy diverge by 7%, with standard deviation of 0.06, from models trained and tested with the private data. Models trained with differential privacy diverge by between 8% and 14%, with standard deviation between 0.06 and 0.12. The results suggest that models trained on synthetic data do suffer loss in accuracy, but that performance divergence is fairly uniform across tasks and that divergence between models trained on data produced by private and non-private generators can be minimised.