{"title":"A One-Pass Distributed and Private Sketch for Kernel Sums with Applications to Machine Learning at Scale","authors":"Benjamin Coleman, Anshumali Shrivastava","doi":"10.1145/3460120.3485255","DOIUrl":null,"url":null,"abstract":"Differential privacy is a compelling privacy definition that explains the privacy-utility tradeoff via formal, provable guarantees. In machine learning, we often wish to release a function over a dataset while preserving differential privacy. Although there are general algorithms to solve this problem for any function, such methods can require hours to days to run on moderately sized datasets. As a result, most private algorithms address task-dependent functions for specific applications. In this work, we propose a general purpose private sketch, or small summary of the dataset, that supports machine learning tasks such as regression, classification, density estimation, and more. Our sketch is ideal for large-scale distributed settings because it is simple to implement, mergeable, and can be created with a one-pass streaming algorithm. At the heart of our proposal is the reduction of many machine learning objectives to kernel sums. Our sketch estimates these sums using randomized contingency tables that are indexed with locality-sensitive hashing. Existing alternatives for kernel sum estimation scale poorly, often exponentially slower with an increase in dimensions. In contrast, our sketch can quickly run on large high-dimensional datasets, such as the 65 million node Friendster graph, in a single pass that takes less than 20 minutes, which is otherwise infeasible with any known alternative. Exhaustive experiments show that the privacy-utility tradeoff of our method is competitive with existing algorithms, but at an order-of-magnitude smaller computational cost. We expect that our sketch will be practically useful for differential privacy in distributed, large-scale machine learning settings.","PeriodicalId":135883,"journal":{"name":"Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3460120.3485255","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Differential privacy is a compelling privacy definition that explains the privacy-utility tradeoff via formal, provable guarantees. In machine learning, we often wish to release a function over a dataset while preserving differential privacy. Although there are general algorithms to solve this problem for any function, such methods can require hours to days to run on moderately sized datasets. As a result, most private algorithms address task-dependent functions for specific applications. In this work, we propose a general purpose private sketch, or small summary of the dataset, that supports machine learning tasks such as regression, classification, density estimation, and more. Our sketch is ideal for large-scale distributed settings because it is simple to implement, mergeable, and can be created with a one-pass streaming algorithm. At the heart of our proposal is the reduction of many machine learning objectives to kernel sums. Our sketch estimates these sums using randomized contingency tables that are indexed with locality-sensitive hashing. Existing alternatives for kernel sum estimation scale poorly, often exponentially slower with an increase in dimensions. In contrast, our sketch can quickly run on large high-dimensional datasets, such as the 65 million node Friendster graph, in a single pass that takes less than 20 minutes, which is otherwise infeasible with any known alternative. Exhaustive experiments show that the privacy-utility tradeoff of our method is competitive with existing algorithms, but at an order-of-magnitude smaller computational cost. We expect that our sketch will be practically useful for differential privacy in distributed, large-scale machine learning settings.