Efficient Discovery of De-identification Policies Through a Risk-Utility Frontier.

CODASPY : proceedings of the ... ACM conference on data and application security and privacy. ACM Conference on Data and Application Security & Privacy Pub Date : 2013-01-01 DOI:10.1145/2435349.2435357

Weiyi Xia, Raymond Heatherly, Xiaofeng Ding, Jiuyong Li, Bradley Malin

{"title":"Efficient Discovery of De-identification Policies Through a Risk-Utility Frontier.","authors":"Weiyi Xia, Raymond Heatherly, Xiaofeng Ding, Jiuyong Li, Bradley Malin","doi":"10.1145/2435349.2435357","DOIUrl":null,"url":null,"abstract":"<p><p>Modern information technologies enable organizations to capture large quantities of person-specific data while providing routine services. Many organizations hope, or are legally required, to share such data for secondary purposes (e.g., validation of research findings) in a de-identified manner. In previous work, it was shown de-identification policy alternatives could be modeled on a lattice, which could be searched for policies that met a prespecified risk threshold (e.g., likelihood of re-identification). However, the search was limited in several ways. First, its definition of utility was syntactic - based on the level of the lattice - and not semantic - based on the actual changes induced in the resulting data. Second, the threshold may not be known in advance. The goal of this work is to build the optimal set of policies that trade-off between privacy risk (R) and utility (U), which we refer to as a R-U frontier. To model this problem, we introduce a semantic definition of utility, based on information theory, that is compatible with the lattice representation of policies. To solve the problem, we initially build a set of policies that define a frontier. We then use a probability-guided heuristic to search the lattice for policies likely to update the frontier. To demonstrate the effectiveness of our approach, we perform an empirical analysis with the Adult dataset of the UCI Machine Learning Repository. We show that our approach can construct a frontier closer to optimal than competitive approaches by searching a smaller number of policies. In addition, we show that a frequently followed de-identification policy (i.e., the Safe Harbor standard of the HIPAA Privacy Rule) is suboptimal in comparison to the frontier discovered by our approach.</p>","PeriodicalId":90472,"journal":{"name":"CODASPY : proceedings of the ... ACM conference on data and application security and privacy. ACM Conference on Data and Application Security & Privacy","volume":"2013 ","pages":"59-70"},"PeriodicalIF":0.0000,"publicationDate":"2013-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4266184/pdf/nihms617161.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CODASPY : proceedings of the ... ACM conference on data and application security and privacy. ACM Conference on Data and Application Security & Privacy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2435349.2435357","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Modern information technologies enable organizations to capture large quantities of person-specific data while providing routine services. Many organizations hope, or are legally required, to share such data for secondary purposes (e.g., validation of research findings) in a de-identified manner. In previous work, it was shown de-identification policy alternatives could be modeled on a lattice, which could be searched for policies that met a prespecified risk threshold (e.g., likelihood of re-identification). However, the search was limited in several ways. First, its definition of utility was syntactic - based on the level of the lattice - and not semantic - based on the actual changes induced in the resulting data. Second, the threshold may not be known in advance. The goal of this work is to build the optimal set of policies that trade-off between privacy risk (R) and utility (U), which we refer to as a R-U frontier. To model this problem, we introduce a semantic definition of utility, based on information theory, that is compatible with the lattice representation of policies. To solve the problem, we initially build a set of policies that define a frontier. We then use a probability-guided heuristic to search the lattice for policies likely to update the frontier. To demonstrate the effectiveness of our approach, we perform an empirical analysis with the Adult dataset of the UCI Machine Learning Repository. We show that our approach can construct a frontier closer to optimal than competitive approaches by searching a smaller number of policies. In addition, we show that a frequently followed de-identification policy (i.e., the Safe Harbor standard of the HIPAA Privacy Rule) is suboptimal in comparison to the frontier discovered by our approach.

Abstract Image

查看原文本刊更多论文

通过风险-效用前沿高效发现去身份化政策

现代信息技术使各组织能够在提供日常服务的同时获取大量个人数据。许多组织希望或法律要求以去标识化的方式共享此类数据，用于第二目的（如验证研究成果）。在以前的工作中，有研究表明，去身份识别政策替代方案可在网格上建模，并可搜索符合预先指定风险阈值（如重新识别的可能性）的政策。然而，这种搜索方法在几个方面受到了限制。首先，它对效用的定义是句法性的--基于网格的层次--而不是语义性的--基于结果数据中的实际变化。其次，阈值可能无法预先知道。这项工作的目标是在隐私风险（R）和效用（U）之间建立一套最佳策略，我们称之为 R-U 边界。为了给这一问题建模，我们引入了基于信息论的效用语义定义，该定义与政策的网格表示法兼容。为了解决这个问题，我们首先建立了一组定义边界的政策。然后，我们使用概率引导的启发式方法在网格中搜索可能更新前沿的政策。为了证明我们方法的有效性，我们利用 UCI 机器学习库的成人数据集进行了实证分析。我们发现，与其他竞争方法相比，我们的方法只需搜索较少数量的策略，就能构建出更接近最优的前沿。此外，我们还表明，与我们的方法所发现的前沿相比，经常采用的去标识化政策（即 HIPAA 隐私规则中的安全港标准）是次优的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

CODASPY : proceedings of the ... ACM conference on data and application security and privacy. ACM Conference on Data and Application Security & Privacy

自引率

0.00%

发文量