保护隐私的大数据发布

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI:10.1145/2791347.2791380

Hessam Zakerzadeh, C. Aggarwal, K. Barker

{"title":"保护隐私的大数据发布","authors":"Hessam Zakerzadeh, C. Aggarwal, K. Barker","doi":"10.1145/2791347.2791380","DOIUrl":null,"url":null,"abstract":"The problem of privacy-preserving data mining has been studied extensively in recent years because of its importance as a key enabler in the sharing of massive data sets. Most of the work in privacy has focussed on issues involving the quality of privacy preservation and utility, though there has been little focus on the issue of scalability in privacy preservation. The reason for this is that anonymization has generally been seen as a batch and one-time process in the context of data sharing. However, in recent years, the sizes of data sets have grown tremendously to a point where the effective application of the current algorithms is becoming increasingly difficult. Furthermore, the transient nature of recent data sets has resulted in an increased need for the repeated application of such methods on the newer data sets which have been collected. Repeated application demands even greater computational efficiency in order to be practical. For example, an algorithm with quadratic complexity is unlikely to be implementable in reasonable time over terabyte scale data sets. A bigger issue is that larger data sets are likely to be addressed by distributed frameworks such as MapReduce. In such frameworks, one has to address the additional issue of minimizing data transfer across different nodes, which is the bottleneck. In this paper, we discuss the first approach towards privacy-preserving data mining of very massive data sets using MapReduce. We study two most widely-used privacy models k-anonymity and l-diversity for anonymization, and present experimental results illustrating the effectiveness of the approach.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"66","resultStr":"{\"title\":\"Privacy-preserving big data publishing\",\"authors\":\"Hessam Zakerzadeh, C. Aggarwal, K. Barker\",\"doi\":\"10.1145/2791347.2791380\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The problem of privacy-preserving data mining has been studied extensively in recent years because of its importance as a key enabler in the sharing of massive data sets. Most of the work in privacy has focussed on issues involving the quality of privacy preservation and utility, though there has been little focus on the issue of scalability in privacy preservation. The reason for this is that anonymization has generally been seen as a batch and one-time process in the context of data sharing. However, in recent years, the sizes of data sets have grown tremendously to a point where the effective application of the current algorithms is becoming increasingly difficult. Furthermore, the transient nature of recent data sets has resulted in an increased need for the repeated application of such methods on the newer data sets which have been collected. Repeated application demands even greater computational efficiency in order to be practical. For example, an algorithm with quadratic complexity is unlikely to be implementable in reasonable time over terabyte scale data sets. A bigger issue is that larger data sets are likely to be addressed by distributed frameworks such as MapReduce. In such frameworks, one has to address the additional issue of minimizing data transfer across different nodes, which is the bottleneck. In this paper, we discuss the first approach towards privacy-preserving data mining of very massive data sets using MapReduce. We study two most widely-used privacy models k-anonymity and l-diversity for anonymization, and present experimental results illustrating the effectiveness of the approach.\",\"PeriodicalId\":225179,\"journal\":{\"name\":\"Proceedings of the 27th International Conference on Scientific and Statistical Database Management\",\"volume\":\"65 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-06-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"66\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 27th International Conference on Scientific and Statistical Database Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2791347.2791380\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2791347.2791380","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 66

摘要

隐私保护数据挖掘作为海量数据集共享的关键因素，近年来得到了广泛的研究。尽管隐私保护的可扩展性问题很少受到关注，但大多数隐私保护工作都集中在涉及隐私保护质量和实用性的问题上。这样做的原因是，在数据共享的背景下，匿名化通常被视为批量和一次性的过程。然而，近年来，数据集的规模急剧增长，使得现有算法的有效应用变得越来越困难。此外，近期数据集的瞬态性质导致越来越需要对已收集的新数据集重复应用这些方法。重复的应用需要更高的计算效率才能实用。例如，在tb规模的数据集上，具有二次复杂度的算法不太可能在合理的时间内实现。一个更大的问题是，更大的数据集可能由分布式框架(如MapReduce)来处理。在这样的框架中，必须解决最小化跨不同节点的数据传输的附加问题，这是瓶颈。在本文中，我们讨论了使用MapReduce对非常大的数据集进行隐私保护数据挖掘的第一种方法。我们研究了两种最广泛使用的匿名化隐私模型k-匿名和l-多样性，并给出了实验结果，说明了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Privacy-preserving big data publishing

The problem of privacy-preserving data mining has been studied extensively in recent years because of its importance as a key enabler in the sharing of massive data sets. Most of the work in privacy has focussed on issues involving the quality of privacy preservation and utility, though there has been little focus on the issue of scalability in privacy preservation. The reason for this is that anonymization has generally been seen as a batch and one-time process in the context of data sharing. However, in recent years, the sizes of data sets have grown tremendously to a point where the effective application of the current algorithms is becoming increasingly difficult. Furthermore, the transient nature of recent data sets has resulted in an increased need for the repeated application of such methods on the newer data sets which have been collected. Repeated application demands even greater computational efficiency in order to be practical. For example, an algorithm with quadratic complexity is unlikely to be implementable in reasonable time over terabyte scale data sets. A bigger issue is that larger data sets are likely to be addressed by distributed frameworks such as MapReduce. In such frameworks, one has to address the additional issue of minimizing data transfer across different nodes, which is the bottleneck. In this paper, we discuss the first approach towards privacy-preserving data mining of very massive data sets using MapReduce. We study two most widely-used privacy models k-anonymity and l-diversity for anonymization, and present experimental results illustrating the effectiveness of the approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 27th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量