Disclosure risk assessment with Bayesian non-parametric hierarchical modelling

arXiv - STAT - Computation Pub Date : 2024-08-22 DOI:arxiv-2408.12521

Marco Battiston, Lorenzo Rimella

{"title":"Disclosure risk assessment with Bayesian non-parametric hierarchical modelling","authors":"Marco Battiston, Lorenzo Rimella","doi":"arxiv-2408.12521","DOIUrl":null,"url":null,"abstract":"Micro and survey datasets often contain private information about\nindividuals, like their health status, income or political preferences.\nPrevious studies have shown that, even after data anonymization, a malicious\nintruder could still be able to identify individuals in the dataset by matching\ntheir variables to external information. Disclosure risk measures are\nstatistical measures meant to quantify how big such a risk is for a specific\ndataset. One of the most common measures is the number of sample unique values\nthat are also population-unique. \\cite{Man12} have shown how mixed membership\nmodels can provide very accurate estimates of this measure. A limitation of\nthat approach is that the number of extreme profiles has to be chosen by the\nmodeller. In this article, we propose a non-parametric version of the model,\nbased on the Hierarchical Dirichlet Process (HDP). The proposed approach does\nnot require any tuning parameter or model selection step and provides accurate\nestimates of the disclosure risk measure, even with samples as small as 1$\\%$\nof the population size. Moreover, a data augmentation scheme to address the\npresence of structural zeros is presented. The proposed methodology is tested\non a real dataset from the New York census.","PeriodicalId":501215,"journal":{"name":"arXiv - STAT - Computation","volume":"9 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Computation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.12521","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Micro and survey datasets often contain private information about individuals, like their health status, income or political preferences. Previous studies have shown that, even after data anonymization, a malicious intruder could still be able to identify individuals in the dataset by matching their variables to external information. Disclosure risk measures are statistical measures meant to quantify how big such a risk is for a specific dataset. One of the most common measures is the number of sample unique values that are also population-unique. \cite{Man12} have shown how mixed membership models can provide very accurate estimates of this measure. A limitation of that approach is that the number of extreme profiles has to be chosen by the modeller. In this article, we propose a non-parametric version of the model, based on the Hierarchical Dirichlet Process (HDP). The proposed approach does not require any tuning parameter or model selection step and provides accurate estimates of the disclosure risk measure, even with samples as small as 1$\%$ of the population size. Moreover, a data augmentation scheme to address the presence of structural zeros is presented. The proposed methodology is tested on a real dataset from the New York census.

查看原文本刊更多论文

利用贝叶斯非参数分层模型进行信息披露风险评估

微观和调查数据集通常包含个人的私人信息，如健康状况、收入或政治偏好等。以往的研究表明，即使在数据匿名化之后，恶意入侵者仍然可以通过将数据集中的个人变量与外部信息进行匹配，从而识别出数据集中的个人。披露风险度量是一种统计度量，旨在量化特定数据集的这种风险有多大。最常见的测量方法之一是样本唯一值中同时也是人口唯一值的数量。\引用{Man12}的研究表明，混合成员模型可以提供非常精确的估计值。这种方法的局限性在于，极端剖面的数量必须由计算者来选择。在本文中，我们提出了一种基于分层迪里希勒过程（HDP）的非参数版本模型。所提出的方法不需要任何调整参数或模型选择步骤，即使样本量只有群体规模的 1%，也能提供准确的披露风险度量估计值。此外，还提出了一种数据增强方案来解决结构零的存在。所提出的方法在纽约人口普查的真实数据集上进行了测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - STAT - Computation

自引率

0.00%

发文量