A One-Pass Distributed and Private Sketch for Kernel Sums with Applications to Machine Learning at Scale

Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security Pub Date : 2021-11-12 DOI:10.1145/3460120.3485255

Benjamin Coleman, Anshumali Shrivastava

{"title":"A One-Pass Distributed and Private Sketch for Kernel Sums with Applications to Machine Learning at Scale","authors":"Benjamin Coleman, Anshumali Shrivastava","doi":"10.1145/3460120.3485255","DOIUrl":null,"url":null,"abstract":"Differential privacy is a compelling privacy definition that explains the privacy-utility tradeoff via formal, provable guarantees. In machine learning, we often wish to release a function over a dataset while preserving differential privacy. Although there are general algorithms to solve this problem for any function, such methods can require hours to days to run on moderately sized datasets. As a result, most private algorithms address task-dependent functions for specific applications. In this work, we propose a general purpose private sketch, or small summary of the dataset, that supports machine learning tasks such as regression, classification, density estimation, and more. Our sketch is ideal for large-scale distributed settings because it is simple to implement, mergeable, and can be created with a one-pass streaming algorithm. At the heart of our proposal is the reduction of many machine learning objectives to kernel sums. Our sketch estimates these sums using randomized contingency tables that are indexed with locality-sensitive hashing. Existing alternatives for kernel sum estimation scale poorly, often exponentially slower with an increase in dimensions. In contrast, our sketch can quickly run on large high-dimensional datasets, such as the 65 million node Friendster graph, in a single pass that takes less than 20 minutes, which is otherwise infeasible with any known alternative. Exhaustive experiments show that the privacy-utility tradeoff of our method is competitive with existing algorithms, but at an order-of-magnitude smaller computational cost. We expect that our sketch will be practically useful for differential privacy in distributed, large-scale machine learning settings.","PeriodicalId":135883,"journal":{"name":"Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3460120.3485255","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Differential privacy is a compelling privacy definition that explains the privacy-utility tradeoff via formal, provable guarantees. In machine learning, we often wish to release a function over a dataset while preserving differential privacy. Although there are general algorithms to solve this problem for any function, such methods can require hours to days to run on moderately sized datasets. As a result, most private algorithms address task-dependent functions for specific applications. In this work, we propose a general purpose private sketch, or small summary of the dataset, that supports machine learning tasks such as regression, classification, density estimation, and more. Our sketch is ideal for large-scale distributed settings because it is simple to implement, mergeable, and can be created with a one-pass streaming algorithm. At the heart of our proposal is the reduction of many machine learning objectives to kernel sums. Our sketch estimates these sums using randomized contingency tables that are indexed with locality-sensitive hashing. Existing alternatives for kernel sum estimation scale poorly, often exponentially slower with an increase in dimensions. In contrast, our sketch can quickly run on large high-dimensional datasets, such as the 65 million node Friendster graph, in a single pass that takes less than 20 minutes, which is otherwise infeasible with any known alternative. Exhaustive experiments show that the privacy-utility tradeoff of our method is competitive with existing algorithms, but at an order-of-magnitude smaller computational cost. We expect that our sketch will be practically useful for differential privacy in distributed, large-scale machine learning settings.

查看原文本刊更多论文

核和的一遍分布式私有草图及其在大规模机器学习中的应用

差分隐私是一个引人注目的隐私定义，它通过正式的、可证明的保证来解释隐私与效用之间的权衡。在机器学习中，我们经常希望在保留差分隐私的同时释放数据集上的函数。尽管对于任何函数都有解决这个问题的通用算法，但这些方法可能需要数小时到数天才能在中等大小的数据集上运行。因此，大多数私有算法都是针对特定应用程序处理与任务相关的函数。在这项工作中，我们提出了一个通用的私有草图，或数据集的小摘要，它支持机器学习任务，如回归、分类、密度估计等。我们的草图是大规模分布式设置的理想选择，因为它易于实现，可合并，并且可以使用一次通过流算法创建。我们建议的核心是将许多机器学习目标简化为核和。我们的草图使用使用位置敏感散列索引的随机列联表来估计这些总和。现有的核和估计替代方案伸缩性差，通常随着维数的增加而呈指数级增长。相比之下，我们的草图可以在大型高维数据集上快速运行，比如6500万个节点的Friendster图，单次运行不到20分钟，这对于任何已知的替代方案来说都是不可行的。详尽的实验表明，我们的方法在隐私-效用权衡方面与现有算法具有竞争力，但计算成本要小得多。我们希望我们的草图将在分布式、大规模机器学习设置中的差异隐私方面具有实际用途。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security

自引率

0.00%

发文量