Building a Dossier on the Cheap: Integrating Distributed Personal Data Resources Under Cost Constraints

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management Pub Date : 2017-11-06 DOI:10.1145/3132847.3132951

Imrul Chowdhury Anindya, Harichandan Roy, Murat Kantarcioglu, B. Malin

{"title":"Building a Dossier on the Cheap: Integrating Distributed Personal Data Resources Under Cost Constraints","authors":"Imrul Chowdhury Anindya, Harichandan Roy, Murat Kantarcioglu, B. Malin","doi":"10.1145/3132847.3132951","DOIUrl":null,"url":null,"abstract":"A wide variety of personal data is routinely collected by numerous organizations that, in turn, share and sell their collections for analytic investigations (e.g., market research). To preserve privacy, certain identifiers are often redacted, perturbed or even removed. A substantial number of attacks have shown that, if care is not taken, such data can be linked to external resources to determine the explicit identifiers (e.g., personal names) or infer sensitive attributes (e.g., income) for the individuals from whom the data was collected. As such, organizations increasingly rely upon record linkage methods to assess the risk such attacks pose and adopt countermeasures accordingly. Traditional linkage methods assume only two datasets would be linked (e.g., linking de-identified hospital discharge to identified voter registration lists), but with the advent of a multi-billion dollar data broker industry, modern adversaries have access to a massive data stash of multiple datasets that can be leveraged. Still, realistic adversaries have budget constraints that prevent them from obtaining and integrating all relevant datasets. Thus, in this work, we investigate a novel privacy risk assessment framework, based on adversaries who plan an integration of datasets for the most accurate estimate of targeted sensitive attributes under a certain budget. To solve this problem, we introduce a graph-based formulation of the problem and predictive modeling methods to prioritize data resources for linkage. We perform an empirical analysis using real world voter registration data from two different U.S. states and show that the methods can be used efficiently to accurately estimate potentially sensitive information disclosure risks even under a non-trivial amount of noise.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"49 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3132847.3132951","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

A wide variety of personal data is routinely collected by numerous organizations that, in turn, share and sell their collections for analytic investigations (e.g., market research). To preserve privacy, certain identifiers are often redacted, perturbed or even removed. A substantial number of attacks have shown that, if care is not taken, such data can be linked to external resources to determine the explicit identifiers (e.g., personal names) or infer sensitive attributes (e.g., income) for the individuals from whom the data was collected. As such, organizations increasingly rely upon record linkage methods to assess the risk such attacks pose and adopt countermeasures accordingly. Traditional linkage methods assume only two datasets would be linked (e.g., linking de-identified hospital discharge to identified voter registration lists), but with the advent of a multi-billion dollar data broker industry, modern adversaries have access to a massive data stash of multiple datasets that can be leveraged. Still, realistic adversaries have budget constraints that prevent them from obtaining and integrating all relevant datasets. Thus, in this work, we investigate a novel privacy risk assessment framework, based on adversaries who plan an integration of datasets for the most accurate estimate of targeted sensitive attributes under a certain budget. To solve this problem, we introduce a graph-based formulation of the problem and predictive modeling methods to prioritize data resources for linkage. We perform an empirical analysis using real world voter registration data from two different U.S. states and show that the methods can be used efficiently to accurately estimate potentially sensitive information disclosure risks even under a non-trivial amount of noise.

查看原文本刊更多论文

建立一个廉价的档案:在成本约束下整合分布式个人数据资源

许多组织定期收集各种各样的个人数据，这些组织反过来分享和出售其收集的数据用于分析调查(例如，市场研究)。为了保护隐私，某些标识符经常被编辑、干扰甚至删除。大量攻击表明，如果不小心，这些数据可以链接到外部资源，以确定明确的标识符(例如，个人姓名)或推断收集数据的个人的敏感属性(例如，收入)。因此，组织越来越依赖于记录链接方法来评估此类攻击所带来的风险并采取相应的对策。传统的链接方法假设只有两个数据集会被链接(例如，将去识别的医院出院病例链接到已识别的选民登记名单)，但随着价值数十亿美元的数据代理行业的出现，现代对手可以访问可以利用的多个数据集的大量数据存储。然而，现实的对手有预算限制，阻止他们获取和整合所有相关数据集。因此，在这项工作中，我们研究了一种新的隐私风险评估框架，该框架基于对手在一定预算下计划数据集集成以最准确地估计目标敏感属性的数据集。为了解决这个问题，我们引入了一种基于图的问题表述和预测建模方法来对数据资源进行优先级排序。我们使用来自美国两个不同州的真实选民登记数据进行了实证分析，并表明即使在非微不足道的噪音下，这些方法也可以有效地准确估计潜在的敏感信息披露风险。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

自引率

0.00%

发文量