Multivariate Top-Coding for Statistical Disclosure Limitation.

Anna Oganian, Ionut Iacob, Goran Lesaja
{"title":"Multivariate Top-Coding for Statistical Disclosure Limitation.","authors":"Anna Oganian, Ionut Iacob, Goran Lesaja","doi":"10.1007/978-3-030-57521-2_10","DOIUrl":null,"url":null,"abstract":"<p><p>One of the most challenging problems for national statistical agencies is how to release to the public microdata sets with a large number of attributes while keeping the disclosure risk of sensitive information of data subjects under control. When statistical agencies alter microdata in order to limit the disclosure risk, they need to take into account relationships between the variables to produce a good quality public data set. Hence, Statistical Disclosure Limitation (SDL) methods should not be univariate (treating each variable independently of others), but preferably multivariate, that is, handling several variables at the same time. Statistical agencies are often concerned about disclosure risk associated with the extreme values of numerical variables. Thus, such observations are often top or bottom-coded in the public use files. Top-coding consists of the substitution of extreme observations of the numerical variable by a threshold, for example, by the 99th percentile of the corresponding variable. Bottom coding is defined similarly but applies to the values in the lower tail of the distribution. We argue that a univariate form of top/bottom-coding may not offer adequate protection for some subpopulations which are different in terms of a top-coded variable from other subpopulations or the whole population. In this paper, we propose a multivariate form of top-coding based on clustering the variables into groups according to some metric of closeness between the variables and then forming the rules for the multivariate top-codes using techniques of Association Rule Mining within the clusters of variables obtained on the previous step. Bottom-coding procedures can be defined in a similar way. We illustrate our method on a genuine multivariate data set of realistic size.</p>","PeriodicalId":91946,"journal":{"name":"Privacy in statistical databases. PSD (Conference : 2004- )","volume":"12276 ","pages":"136-148"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8057308/pdf/nihms-1676966.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Privacy in statistical databases. PSD (Conference : 2004- )","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/978-3-030-57521-2_10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/9/16 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

One of the most challenging problems for national statistical agencies is how to release to the public microdata sets with a large number of attributes while keeping the disclosure risk of sensitive information of data subjects under control. When statistical agencies alter microdata in order to limit the disclosure risk, they need to take into account relationships between the variables to produce a good quality public data set. Hence, Statistical Disclosure Limitation (SDL) methods should not be univariate (treating each variable independently of others), but preferably multivariate, that is, handling several variables at the same time. Statistical agencies are often concerned about disclosure risk associated with the extreme values of numerical variables. Thus, such observations are often top or bottom-coded in the public use files. Top-coding consists of the substitution of extreme observations of the numerical variable by a threshold, for example, by the 99th percentile of the corresponding variable. Bottom coding is defined similarly but applies to the values in the lower tail of the distribution. We argue that a univariate form of top/bottom-coding may not offer adequate protection for some subpopulations which are different in terms of a top-coded variable from other subpopulations or the whole population. In this paper, we propose a multivariate form of top-coding based on clustering the variables into groups according to some metric of closeness between the variables and then forming the rules for the multivariate top-codes using techniques of Association Rule Mining within the clusters of variables obtained on the previous step. Bottom-coding procedures can be defined in a similar way. We illustrate our method on a genuine multivariate data set of realistic size.

统计披露限制的多变量顶部编码。
国家统计机构面临的最具挑战性的问题之一是,如何向公众发布具有大量属性的微观数据集,同时控制数据主体敏感信息的披露风险。当统计机构为了限制披露风险而更改微观数据时,他们需要考虑变量之间的关系,以生成高质量的公共数据集。因此,统计披露限制(SDL)方法不应是单变量的(独立于其他变量处理每个变量),而最好是多变量的,即同时处理多个变量。统计机构经常关注与数字变量极值相关的披露风险。因此,此类观察结果通常在公共使用文件中进行顶部或底部编码。顶部编码包括用阈值替换数值变量的极端观测值,例如用相应变量的第99个百分位数。底部编码的定义类似,但适用于分布的下尾部中的值。我们认为,单变量形式的顶部/底部编码可能无法为某些亚群体提供足够的保护,这些亚群体在顶部编码变量方面与其他亚群体或整个群体不同。在本文中,我们提出了一种多元顶端编码形式,该形式基于根据变量之间的一些接近度度量将变量聚类成组,然后在前一步获得的变量聚类中使用关联规则挖掘技术形成多元顶端编码的规则。底部编码过程可以用类似的方式定义。我们在真实大小的真实多元数据集上说明了我们的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信