Local synthesis for disclosure limitation that satisfies probabilistic k-anonymity criterion.

IF 0.8 Q3 COMPUTER SCIENCE, THEORY & METHODS

Transactions on Data Privacy Pub Date : 2017-04-01

Anna Oganian, Josep Domingo-Ferrer

{"title":"Local synthesis for disclosure limitation that satisfies probabilistic k-anonymity criterion.","authors":"Anna Oganian, Josep Domingo-Ferrer","doi":"","DOIUrl":null,"url":null,"abstract":"Before releasing databases which contain sensitive information about individuals, data publishers must apply Statistical Disclosure Limitation (SDL) methods to them, in order to avoid disclosure of sensitive information on any identifiable data subject. SDL methods often consist of masking or synthesizing the original data records in such a way as to minimize the risk of disclosure of the sensitive information while providing data users with accurate information about the population of interest. In this paper we propose a new scheme for disclosure limitation, based on the idea of local synthesis of data. Our approach is predicated on model-based clustering. The proposed method satisfies the requirements of k-anonymity; in particular we use a variant of the k-anonymity privacy model, namely probabilistic k-anonymity, by incorporating constraints on cluster cardinality. Regarding data utility, for continuous attributes, we exactly preserve means and covariances of the original data, while approximately preserving higher-order moments and analyses on subdomains (defined by clusters and cluster combinations). For both continuous and categorical data, our experiments with medical data sets show that, from the point of view of data utility, local synthesis compares very favorably with other methods of disclosure limitation including the sequential regression approach for synthetic data generation.","PeriodicalId":44319,"journal":{"name":"Transactions on Data Privacy","volume":"10 1","pages":"61-81"},"PeriodicalIF":0.8000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6760907/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transactions on Data Privacy","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Before releasing databases which contain sensitive information about individuals, data publishers must apply Statistical Disclosure Limitation (SDL) methods to them, in order to avoid disclosure of sensitive information on any identifiable data subject. SDL methods often consist of masking or synthesizing the original data records in such a way as to minimize the risk of disclosure of the sensitive information while providing data users with accurate information about the population of interest. In this paper we propose a new scheme for disclosure limitation, based on the idea of local synthesis of data. Our approach is predicated on model-based clustering. The proposed method satisfies the requirements of k-anonymity; in particular we use a variant of the k-anonymity privacy model, namely probabilistic k-anonymity, by incorporating constraints on cluster cardinality. Regarding data utility, for continuous attributes, we exactly preserve means and covariances of the original data, while approximately preserving higher-order moments and analyses on subdomains (defined by clusters and cluster combinations). For both continuous and categorical data, our experiments with medical data sets show that, from the point of view of data utility, local synthesis compares very favorably with other methods of disclosure limitation including the sequential regression approach for synthetic data generation.

本刊更多论文

满足概率k-匿名准则的披露限制的局部综合。

在发布包含个人敏感信息的数据库之前，数据发布者必须对其应用统计披露限制（SDL）方法，以避免披露任何可识别的数据主体的敏感信息。SDL方法通常包括屏蔽或合成原始数据记录，以便尽量减少敏感信息泄露的风险，同时向数据用户提供有关感兴趣群体的准确信息。本文基于数据局部综合的思想，提出了一种新的信息披露限制方案。我们的方法是基于模型的聚类。该方法满足k-匿名性要求；特别地，我们使用k-匿名隐私模型的一种变体，即概率k-匿名，通过结合对聚类基数的约束。关于数据效用，对于连续属性，我们准确地保留了原始数据的均值和协方差，同时近似地保留了子域（由聚类和聚类组合定义）的高阶矩和分析。对于连续数据和分类数据，我们对医疗数据集的实验表明，从数据效用的角度来看，局部合成与其他披露限制的方法（包括用于合成数据生成的顺序回归方法）相比非常有利。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Transactions on Data Privacy COMPUTER SCIENCE, THEORY & METHODS-

CiteScore

3.00

自引率

0.00%

发文量