A Partial Optimization Approach for Privacy Preserving Frequent Itemset Mining
Shibnath Mukherjee, A. Gangopadhyay, Zhiyuan Chen
{"title":"A Partial Optimization Approach for Privacy Preserving Frequent Itemset Mining","authors":"Shibnath Mukherjee, A. Gangopadhyay, Zhiyuan Chen","doi":"10.4018/jcmam.2010072002","DOIUrl":null,"url":null,"abstract":"While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, would result in hiding sensitive patterns while reducing the accidental hiding of legitimate patterns and the damage done to the database as much as possible. Their methodology allows the user to adjust their preference on the weights assigned to benefits in terms of the number of restrictive patterns hidden, cost in terms of the number of legitimate patterns hidden, and damage to the database in terms of the difference between marginal frequencies of items for the original and sanitized databases. Most approaches in solving the given problem found in literature are all-heuristic based without formal treatment for optimality. While in a few work, ILP has been used previously as a formal optimization approach, the novelty of this method is the extremely low cost-complexity model in contrast to the others. They implement our methodology in C and C++ and ran several experiments with synthetic data generated with the IBM synthetic data generator. The experiments show excellent results when compared to those in the literature. DOI: 10.4018/jcmam.2010072002 IGI PUBLISHING This paper appears in the publication, International Journal of Computational Models and Algorithms in Medicine, Volume 1, Issue 1 edited by Aryya Gangopadhyay © 2010, IGI Global 701 E. Chocolate Avenue, Hershey PA 17033-1240, USA Tel: 717/533-8845; Fax 717/533-8661; URL-http://www.igi-global.com ITJ 5527 20 International Journal of Computational Models and Algorithms in Medicine, 1(1), 19-33, January-March 2010 Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 2002; Oliviera et al., 2003a, 2003b; Han et al., 2006). A number of cases have been reported in literature where data mining actually has posed threats to discovery of sensitive knowledge and violating privacy. One typical problem is that of inferencing, which means inferring sensitive information from non-sensitive or unclassified data (Oliviera et al., 2002; Clifton, 2001). Data mining is part of the larger business intelligence initiatives that are taking place in organizations across government and industry sectors, many of which include medical applications. It is being used for prediction as well knowledge discovery that can lead to cost reduction, business expansion, and detection of fraud or wastage of resources, among other things. With its many benefits, data mining has given rise to increasingly complex and controversial privacy issues. For example, the privacy implications of data mining have lead to high profile controversies involving the use of data mining tools and techniques on data related to drug prescriptions. Two major health care data publishers filed a petition to the Supreme Court on whether commercial use of data mining is protected by the First Amendment1, an appeal to a controversial ruling by the 1st U.S. Circuit Court of Appeals that upheld a 2006 New Hampshire law that banned the usage of doctor’s prescription history to increase drug sales. Privacy implications are a major roadblock to information sharing across organizations. For example, sharing inventory data might reveal information that can be used to gain strategic advantages by competitors. Unless the actual or perceived implications of data mining methods on privacy issues are properly dealt with, it can lead to sub-optimal decision making in organizations, and reluctance to accept such tools by the public in general. For example there could be benefits in sharing prescription data from different pharmacy stores to mine for information such as the use of generic drugs, socio-demographic and geographic analysis of prescription drugs, which will require moving the data from each store or site to a central location, which increases the risks of litigation. In general several potential problems that have been identified for privacy protection make the case for privacy reserving data mining. These include: legal requirements for protecting data (e.g. HIPAA healthcare regulations in the US) Federal register (2002), liability from inadvertent disclosure of data, risk of misuse of proprietary information (Atallah et al., 2003), and antitrust concerns (Vaidya et al., 2006). Thus it is of growing importance to devise efficient tradeoffs between knowledge discovery and knowledge hiding from databases so that cost to the involved, in general, gets minimized in the process yet the benefit is maximized. The work that will be presented in this article will focus on formulating a model for sanitization of databases against discovery of restrictive associative patterns, while distorting the databases and legitimate pattern discovery as little as possible. To illustrate the problem, consider a classic example given in (Evfimienski et al., 2002; Oliviera et al., 2002). There is a server and several clients, each having its own set of items. The clients want the server to provide them with recommendations based on statistical information about association among items. However the clients do not want the server to know some restrictive patterns. Now what is sent to the server is the raw database and in its process of searching for frequent patterns the server will discover the restrictive patterns as well. Thus what the client has to send is the raw database, modified in a manner so that the restrictive patterns are not discovered. But this needs distortion to the raw database before sending it to the server and the distortion should be such that it is minimal and hiding of the legitimate patterns is also minimal. Other examples of the problem are given in (Verikyos et al., 2004). The example shows the vulnerability of critical frequent patterns, however it is directly associated with the problems of exposing critical association rules as well since rules are built from patterns. Indeed some of the research work like (Verikyos et al 2004) use reduction of support of sensitive frequent patterns as one of the methods to hide association rules that could be generated from them. All these methods are based on modifying the 13 more pages are available in the full version of this document, which may be purchased using the \"Add to Cart\" button on the publisher's webpage: www.igi-global.com/article/partial-optimization-approachprivacy-preserving/38942","PeriodicalId":162417,"journal":{"name":"Int. J. Comput. Model. Algorithms Medicine","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Model. Algorithms Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4018/jcmam.2010072002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
While data mining has been widely acclaimed as a technology that can bring potential benefits to organizations, such efforts may be negatively impacted by the possibility of discovering sensitive patterns, particularly in patient data. In this article the authors present an approach to identify the optimal set of transactions that, if sanitized, would result in hiding sensitive patterns while reducing the accidental hiding of legitimate patterns and the damage done to the database as much as possible. Their methodology allows the user to adjust their preference on the weights assigned to benefits in terms of the number of restrictive patterns hidden, cost in terms of the number of legitimate patterns hidden, and damage to the database in terms of the difference between marginal frequencies of items for the original and sanitized databases. Most approaches in solving the given problem found in literature are all-heuristic based without formal treatment for optimality. While in a few work, ILP has been used previously as a formal optimization approach, the novelty of this method is the extremely low cost-complexity model in contrast to the others. They implement our methodology in C and C++ and ran several experiments with synthetic data generated with the IBM synthetic data generator. The experiments show excellent results when compared to those in the literature. DOI: 10.4018/jcmam.2010072002 IGI PUBLISHING This paper appears in the publication, International Journal of Computational Models and Algorithms in Medicine, Volume 1, Issue 1 edited by Aryya Gangopadhyay © 2010, IGI Global 701 E. Chocolate Avenue, Hershey PA 17033-1240, USA Tel: 717/533-8845; Fax 717/533-8661; URL-http://www.igi-global.com ITJ 5527 20 International Journal of Computational Models and Algorithms in Medicine, 1(1), 19-33, January-March 2010 Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. 2002; Oliviera et al., 2003a, 2003b; Han et al., 2006). A number of cases have been reported in literature where data mining actually has posed threats to discovery of sensitive knowledge and violating privacy. One typical problem is that of inferencing, which means inferring sensitive information from non-sensitive or unclassified data (Oliviera et al., 2002; Clifton, 2001). Data mining is part of the larger business intelligence initiatives that are taking place in organizations across government and industry sectors, many of which include medical applications. It is being used for prediction as well knowledge discovery that can lead to cost reduction, business expansion, and detection of fraud or wastage of resources, among other things. With its many benefits, data mining has given rise to increasingly complex and controversial privacy issues. For example, the privacy implications of data mining have lead to high profile controversies involving the use of data mining tools and techniques on data related to drug prescriptions. Two major health care data publishers filed a petition to the Supreme Court on whether commercial use of data mining is protected by the First Amendment1, an appeal to a controversial ruling by the 1st U.S. Circuit Court of Appeals that upheld a 2006 New Hampshire law that banned the usage of doctor’s prescription history to increase drug sales. Privacy implications are a major roadblock to information sharing across organizations. For example, sharing inventory data might reveal information that can be used to gain strategic advantages by competitors. Unless the actual or perceived implications of data mining methods on privacy issues are properly dealt with, it can lead to sub-optimal decision making in organizations, and reluctance to accept such tools by the public in general. For example there could be benefits in sharing prescription data from different pharmacy stores to mine for information such as the use of generic drugs, socio-demographic and geographic analysis of prescription drugs, which will require moving the data from each store or site to a central location, which increases the risks of litigation. In general several potential problems that have been identified for privacy protection make the case for privacy reserving data mining. These include: legal requirements for protecting data (e.g. HIPAA healthcare regulations in the US) Federal register (2002), liability from inadvertent disclosure of data, risk of misuse of proprietary information (Atallah et al., 2003), and antitrust concerns (Vaidya et al., 2006). Thus it is of growing importance to devise efficient tradeoffs between knowledge discovery and knowledge hiding from databases so that cost to the involved, in general, gets minimized in the process yet the benefit is maximized. The work that will be presented in this article will focus on formulating a model for sanitization of databases against discovery of restrictive associative patterns, while distorting the databases and legitimate pattern discovery as little as possible. To illustrate the problem, consider a classic example given in (Evfimienski et al., 2002; Oliviera et al., 2002). There is a server and several clients, each having its own set of items. The clients want the server to provide them with recommendations based on statistical information about association among items. However the clients do not want the server to know some restrictive patterns. Now what is sent to the server is the raw database and in its process of searching for frequent patterns the server will discover the restrictive patterns as well. Thus what the client has to send is the raw database, modified in a manner so that the restrictive patterns are not discovered. But this needs distortion to the raw database before sending it to the server and the distortion should be such that it is minimal and hiding of the legitimate patterns is also minimal. Other examples of the problem are given in (Verikyos et al., 2004). The example shows the vulnerability of critical frequent patterns, however it is directly associated with the problems of exposing critical association rules as well since rules are built from patterns. Indeed some of the research work like (Verikyos et al 2004) use reduction of support of sensitive frequent patterns as one of the methods to hide association rules that could be generated from them. All these methods are based on modifying the 13 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the publisher's webpage: www.igi-global.com/article/partial-optimization-approachprivacy-preserving/38942
一种保持隐私的频繁项集挖掘的部分优化方法
虽然数据挖掘作为一种可以为组织带来潜在利益的技术而受到广泛赞誉,但这种努力可能会受到发现敏感模式的可能性的负面影响,特别是在患者数据中。在本文中,作者提出了一种识别最优事务集的方法,这些事务集如果经过清理,将导致隐藏敏感模式,同时尽可能减少合法模式的意外隐藏和对数据库造成的损害。他们的方法允许用户根据隐藏的限制性模式的数量来调整他们对分配给收益的权重的偏好,根据隐藏的合法模式的数量来调整成本,以及根据原始数据库和经过处理的数据库中项目的边际频率之间的差异来调整对数据库的损害。在文献中发现的大多数解决给定问题的方法都是基于启发式的,没有对最优性进行正式的处理。虽然在一些工作中,ILP已经被用作一种正式的优化方法,但与其他方法相比,这种方法的新颖之处在于其成本复杂性极低。他们用C和c++实现了我们的方法,并用IBM合成数据生成器生成的合成数据进行了几个实验。与文献中的实验结果相比,实验结果非常好。DOI: 10.4018 / jcmam.2010072002这篇论文发表在国际医学计算模型和算法杂志上,第1卷,第1期,由Aryya Gangopadhyay编辑©2010,IGI Global 701 E. Chocolate Avenue, Hershey PA 17033-1240, USA Tel: 717/533-8845;传真717/533 - 8661;URL-http://www.igi-global.com ITJ 5527 20 International Journal of Computational Models and Algorithms in Medicine, 1(1), 19-33, jan - march 2010版权所有©2010,IGI Global。未经IGI Global书面许可,禁止以印刷或电子形式复制或分发。2002;Oliviera et al., 2003a, 2003b;Han et al., 2006)。文献中已经报道了许多案例,其中数据挖掘实际上对发现敏感知识和侵犯隐私构成了威胁。一个典型的问题是推理,即从非敏感或未分类的数据中推断出敏感信息(Oliviera et al., 2002;克利夫顿,2001)。数据挖掘是大型商业智能计划的一部分,这些计划正在跨政府和行业部门的组织中进行,其中许多包括医疗应用程序。它被用于预测和知识发现,从而降低成本、扩展业务、检测欺诈或资源浪费等。数据挖掘有很多好处,但也引起了越来越复杂和有争议的隐私问题。例如,数据挖掘对隐私的影响导致了涉及使用数据挖掘工具和技术处理与药物处方有关的数据的高度争议。两家主要的医疗保健数据出版商向最高法院提交了一份请愿书,就数据挖掘的商业用途是否受到第一修正案的保护提出了上诉,这是对美国第一巡回上诉法院一项有争议的裁决的上诉,该裁决维持了2006年新罕布什尔州的一项法律,该法律禁止利用医生的处方历史来增加药品销售。隐私问题是跨组织共享信息的主要障碍。例如,共享库存数据可能会揭示竞争对手可以用来获得战略优势的信息。除非数据挖掘方法对隐私问题的实际或感知影响得到适当处理,否则它可能导致组织中的次优决策,并且一般公众不愿接受此类工具。例如,共享来自不同药店的处方数据,以挖掘诸如仿制药的使用、处方药的社会人口统计学和地理分析等信息,可能会有好处,这将需要将数据从每个商店或站点移动到一个中心位置,这增加了诉讼的风险。一般来说,已经为隐私保护确定的几个潜在问题使保留隐私的数据挖掘成为可能。其中包括:保护数据的法律要求(例如美国的HIPAA医疗保健法规)联邦登记册(2002年)、无意披露数据的责任、滥用专有信息的风险(Atallah等人,2003年)以及反垄断问题(Vaidya等人,2006年)。因此,如何在知识发现和数据库隐藏之间进行有效的权衡,使知识发现和数据库隐藏之间的成本最小化而收益最大化变得越来越重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。