DataSynthesizer: Privacy-Preserving Synthetic Datasets

Proceedings of the 29th International Conference on Scientific and Statistical Database Management Pub Date : 2017-06-27 DOI:10.1145/3085504.3091117

Haoyue Ping, Julia Stoyanovich, Bill Howe

{"title":"DataSynthesizer: Privacy-Preserving Synthetic Datasets","authors":"Haoyue Ping, Julia Stoyanovich, Bill Howe","doi":"10.1145/3085504.3091117","DOIUrl":null,"url":null,"abstract":"To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability --- the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules --- DataDescriber, DataGenerator and ModelInspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. ModelInspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance. The code implementing all parts of this work is publicly available at https://github.com/DataResponsibly/DataSynthesizer.","PeriodicalId":431308,"journal":{"name":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","volume":"128 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"96","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3085504.3091117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 96

Abstract

To facilitate collaboration over sensitive data, we present DataSynthesizer, a tool that takes a sensitive dataset as input and generates a structurally and statistically similar synthetic dataset with strong privacy guarantees. The data owners need not release their data, while potential collaborators can begin developing models and methods with some confidence that their results will work similarly on the real dataset. The distinguishing feature of DataSynthesizer is its usability --- the data owner does not have to specify any parameters to start generating and sharing data safely and effectively. DataSynthesizer consists of three high-level modules --- DataDescriber, DataGenerator and ModelInspector. The first, DataDescriber, investigates the data types, correlations and distributions of the attributes in the private dataset, and produces a data summary, adding noise to the distributions to preserve privacy. DataGenerator samples from the summary computed by DataDescriber and outputs synthetic data. ModelInspector shows an intuitive description of the data summary that was computed by DataDescriber, allowing the data owner to evaluate the accuracy of the summarization process and adjust any parameters, if desired. We describe DataSynthesizer and illustrate its use in an urban science context, where sharing sensitive, legally encumbered data between agencies and with outside collaborators is reported as the primary obstacle to data-driven governance. The code implementing all parts of this work is publicly available at https://github.com/DataResponsibly/DataSynthesizer.

查看原文本刊更多论文

DataSynthesizer:隐私保护合成数据集

为了促进敏感数据的协作，我们提出了DataSynthesizer，这是一种工具，它将敏感数据集作为输入，并生成具有强隐私保证的结构和统计相似的合成数据集。数据所有者不需要发布他们的数据，而潜在的合作者可以开始开发模型和方法，并有信心他们的结果将在真实数据集上类似地工作。DataSynthesizer的显著特点是它的可用性——数据所有者不必指定任何参数就可以安全有效地开始生成和共享数据。DataSynthesizer由三个高级模块组成——DataDescriber、DataGenerator和ModelInspector。第一个是DataDescriber，它研究私有数据集中属性的数据类型、相关性和分布，并生成数据摘要，在分布中添加噪声以保护隐私。dataggenerator从DataDescriber计算的摘要中采样并输出合成数据。ModelInspector显示了由DataDescriber计算的数据摘要的直观描述，允许数据所有者评估摘要过程的准确性，并根据需要调整任何参数。我们描述了DataSynthesizer，并说明了它在城市科学背景下的使用，在城市科学背景下，在机构之间和外部合作者之间共享敏感的、受法律限制的数据被认为是数据驱动治理的主要障碍。实现这项工作所有部分的代码可在https://github.com/DataResponsibly/DataSynthesizer上公开获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 29th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量