Tanveer Khan, Mindaugas Budzys, Khoa Nguyen, A. Michalas
{"title":"SoK: Wildest Dreams: Reproducible Research in Privacy-preserving Neural Network Training","authors":"Tanveer Khan, Mindaugas Budzys, Khoa Nguyen, A. Michalas","doi":"10.56553/popets-2024-0072","DOIUrl":null,"url":null,"abstract":"Machine Learning (ML), addresses a multitude of complex issues in multiple disciplines, including social sciences, finance, and medical research. ML models require substantial computing power and are only as powerful as the data utilized. Due to the high computational cost of ML methods, data scientists frequently use Machine Learning-as-a-Service (MLaaS) to outsource computation to external servers. However, when working with private information, like financial data or health records, outsourcing the computation might result in privacy issues. Recent advances in Privacy-Preserving Techniques (PPTs) have enabled ML training and inference over protected data through the use of Privacy-Preserving Machine Learning (PPML). However, these techniques are still at a preliminary stage and their application in real-world situations is demanding. In order to comprehend the discrepancy between theoretical research suggestions and actual applications, this work examines the past and present of PPML, focusing on Homomorphic Encryption (HE) and Secure Multi-party Computation (SMPC) applied to ML. This work primarily focuses on the ML model's training phase, where maintaining user data privacy is of utmost importance. We provide a solid theoretical background that eases the understanding of current approaches and their limitations. We also provide some preliminaries of SMPC, HE, and ML. In addition, we present a systemization of knowledge of the most recent PPML frameworks for model training and provide a comprehensive comparison in terms of the unique properties and performances on standard benchmarks. Also, we reproduce the results for some of the surveyed papers and examine at what level existing works in the field provide support for open science. We believe our work serves as a valuable contribution by raising awareness about the current gap between theoretical advancements and real-world applications in PPML, specifically regarding open-source availability, reproducibility, and usability.","PeriodicalId":519525,"journal":{"name":"Proceedings on Privacy Enhancing Technologies","volume":"16 9","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings on Privacy Enhancing Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.56553/popets-2024-0072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Machine Learning (ML), addresses a multitude of complex issues in multiple disciplines, including social sciences, finance, and medical research. ML models require substantial computing power and are only as powerful as the data utilized. Due to the high computational cost of ML methods, data scientists frequently use Machine Learning-as-a-Service (MLaaS) to outsource computation to external servers. However, when working with private information, like financial data or health records, outsourcing the computation might result in privacy issues. Recent advances in Privacy-Preserving Techniques (PPTs) have enabled ML training and inference over protected data through the use of Privacy-Preserving Machine Learning (PPML). However, these techniques are still at a preliminary stage and their application in real-world situations is demanding. In order to comprehend the discrepancy between theoretical research suggestions and actual applications, this work examines the past and present of PPML, focusing on Homomorphic Encryption (HE) and Secure Multi-party Computation (SMPC) applied to ML. This work primarily focuses on the ML model's training phase, where maintaining user data privacy is of utmost importance. We provide a solid theoretical background that eases the understanding of current approaches and their limitations. We also provide some preliminaries of SMPC, HE, and ML. In addition, we present a systemization of knowledge of the most recent PPML frameworks for model training and provide a comprehensive comparison in terms of the unique properties and performances on standard benchmarks. Also, we reproduce the results for some of the surveyed papers and examine at what level existing works in the field provide support for open science. We believe our work serves as a valuable contribution by raising awareness about the current gap between theoretical advancements and real-world applications in PPML, specifically regarding open-source availability, reproducibility, and usability.
机器学习(ML)解决了社会科学、金融和医学研究等多个学科中的众多复杂问题。ML 模型需要强大的计算能力,而且只有利用的数据越强大,ML 模型才越强大。由于 ML 方法的计算成本较高,数据科学家经常使用机器学习即服务(MLaaS)将计算外包给外部服务器。然而,在处理金融数据或健康记录等私人信息时,外包计算可能会导致隐私问题。隐私保护技术(PPT)的最新进展是,通过使用隐私保护机器学习(PPML),可以在受保护的数据上进行 ML 训练和推理。然而,这些技术仍处于初级阶段,在现实世界中的应用要求很高。为了理解理论研究建议与实际应用之间的差异,本研究探讨了 PPML 的过去和现在,重点关注应用于 ML 的同态加密(HE)和安全多方计算(SMPC)。这项工作主要侧重于 ML 模型的训练阶段,在这一阶段,维护用户数据隐私至关重要。我们提供了坚实的理论背景,有助于理解当前的方法及其局限性。我们还对 SMPC、HE 和 ML 做了一些初步介绍。此外,我们还介绍了用于模型训练的最新 PPML 框架的系统化知识,并就其独特属性和在标准基准上的性能进行了全面比较。此外,我们还重现了一些调查论文的结果,并研究了该领域的现有作品在多大程度上为开放科学提供了支持。我们相信,我们的工作能够提高人们对当前 PPML 理论进展与实际应用之间差距的认识,特别是在开源可用性、可重复性和可用性方面,从而做出有价值的贡献。