Saibot:一个与众不同的私有数据搜索平台

Proc. VLDB Endow. Pub Date : 2023-07-01 DOI:10.48550/arXiv.2307.00432

Zezhou Huang, Jiaxiang Liu, Daniel Alabi, R. Fernandez, Eugene Wu

{"title":"Saibot:一个与众不同的私有数据搜索平台","authors":"Zezhou Huang, Jiaxiang Liu, Daniel Alabi, R. Fernandez, Eugene Wu","doi":"10.48550/arXiv.2307.00432","DOIUrl":null,"url":null,"abstract":"\n Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search for\n augmentations\n ---join or union-compatible datasets---that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets.\n \n \n We present\n Saibot\n , a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that\n Saibot\n can return augmentations that achieve model accuracy within 50--90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.\n","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"27 1","pages":"3057-3070"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Saibot: A Differentially Private Data Search Platform\",\"authors\":\"Zezhou Huang, Jiaxiang Liu, Daniel Alabi, R. Fernandez, Eugene Wu\",\"doi\":\"10.48550/arXiv.2307.00432\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search for\\n augmentations\\n ---join or union-compatible datasets---that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets.\\n \\n \\n We present\\n Saibot\\n , a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that\\n Saibot\\n can return augmentations that achieve model accuracy within 50--90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.\\n\",\"PeriodicalId\":20467,\"journal\":{\"name\":\"Proc. VLDB Endow.\",\"volume\":\"27 1\",\"pages\":\"3057-3070\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proc. VLDB Endow.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2307.00432\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2307.00432","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

最近的数据搜索平台使用基于机器学习任务的实用度量而不是基于元数据的关键字来搜索大型数据集语料库。请求者提交一个训练数据集，这些平台搜索增强-连接或联合兼容的数据集-当用于增强请求者的数据集时，大多数改进模型(例如，线性回归)性能。虽然有效，但管理个人可识别数据的提供商在授予这些平台数据访问权限之前需要差分隐私(DP)保证。不幸的是，让数据搜索具有不同的私密性并非易事，因为单个搜索可能涉及数百或数千次训练和评估数据集，从而迅速耗尽隐私预算。我们提出了Saibot，一个差分私有数据搜索平台，它采用了分解隐私机制(FPM)，一种新的DP机制，来计算ML在不同数据集组合上的足够的半环统计量。这些统计数据被私有化一次，并且可以在搜索中自由重用。这使得Saibot可以扩展到任意数量的数据集和请求，同时最小化DP噪声对搜索结果的影响。我们优化了FPM对常见增广操作的灵敏度，并分析了它在线性回归方面的性质。具体来说，我们开发了一个多对多连接的无偏估计器，证明了它的边界，并开发了一个优化来重新分配DP噪声以最小化对模型的影响。我们对包含329个数据集的真实数据集语料库的评估表明，Saibot可以返回在非私有搜索的50- 90%内实现模型精度的增强，而领先的替代DP机制(TPM, APM，洗牌)则差几个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Saibot: A Differentially Private Data Search Platform

Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search for augmentations ---join or union-compatible datasets---that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We present Saibot , a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50--90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量