Fast text anonymization using k-anonyminity

Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services Pub Date : 2016-11-28 DOI:10.1145/3011141.3011217

Wakana Maeda, Yumiko Suzuki, Satoshi Nakamura

{"title":"Fast text anonymization using k-anonyminity","authors":"Wakana Maeda, Yumiko Suzuki, Satoshi Nakamura","doi":"10.1145/3011141.3011217","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a method for anonymizing unstructured texts using a quasi-identifier list. In our method, the system redacts from some parts of quasi-identifiers in the texts to the alternate characters such as \"*\", in order to prevent re-identification of information which should be kept in secrecy. However, this method has a room for an improvement for keeping the information on the original text as is. If the system anonymizes the texts and keeps the original texts as much as possible, the accuracy of the outputs by data mining techniques for the anonymized texts should be useful. Our method anonymizes quasi-identifiers to remain substrings which do not contribute to re-identification, in order to keep the information on the original texts as is. Concretely, the system identifies the substrings which should be redacted to satisfy the following two conditions: 1) Any terms in the quasi-identifier list satisfies k-anonymity by redacting characters. 2) The number of redacted characters is minimized. From the quasi-identifier list, we construct the anonymization dictionary which records the two number in advance; the number of quasi-identifiers which are anonymized in the same way, and a number of redacted characters of the anonymized quasi-identifier. However, this construction step is time consuming, because the system needs to retrieve a huge number of patterns. To solve this problem, we propose an acceleration method for constructing the anonymization dictionary using several heuristics and the set theory.","PeriodicalId":247823,"journal":{"name":"Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3011141.3011217","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

In this paper, we propose a method for anonymizing unstructured texts using a quasi-identifier list. In our method, the system redacts from some parts of quasi-identifiers in the texts to the alternate characters such as "*", in order to prevent re-identification of information which should be kept in secrecy. However, this method has a room for an improvement for keeping the information on the original text as is. If the system anonymizes the texts and keeps the original texts as much as possible, the accuracy of the outputs by data mining techniques for the anonymized texts should be useful. Our method anonymizes quasi-identifiers to remain substrings which do not contribute to re-identification, in order to keep the information on the original texts as is. Concretely, the system identifies the substrings which should be redacted to satisfy the following two conditions: 1) Any terms in the quasi-identifier list satisfies k-anonymity by redacting characters. 2) The number of redacted characters is minimized. From the quasi-identifier list, we construct the anonymization dictionary which records the two number in advance; the number of quasi-identifiers which are anonymized in the same way, and a number of redacted characters of the anonymized quasi-identifier. However, this construction step is time consuming, because the system needs to retrieve a huge number of patterns. To solve this problem, we propose an acceleration method for constructing the anonymization dictionary using several heuristics and the set theory.

查看原文本刊更多论文

使用k-匿名快速文本匿名化

在本文中，我们提出了一种使用准标识符列表匿名化非结构化文本的方法。在我们的方法中，系统将文本中准标识符的某些部分编辑为替代字符，如“*”，以防止对应该保密的信息进行重新识别。但是，这种方法在保留原始文本信息方面还有改进的余地。如果系统匿名化文本并尽可能地保留原始文本，那么通过数据挖掘技术对匿名文本输出的准确性应该是有用的。我们的方法匿名化准标识符，以保留不有助于重新识别的子字符串，以便保留原始文本的信息。具体来说，系统识别出需要编校的子字符串，以满足以下两个条件:1)准标识符列表中的任何项通过编校字符满足k-匿名性。2)编辑字符的数量被最小化。从准标识符表出发，构造了预先记录两个数字的匿名化字典;以相同方式匿名化的准标识符的数目，以及匿名化的准标识符的编校字符的数目。然而，这个构造步骤非常耗时，因为系统需要检索大量的模式。为了解决这一问题，我们提出了一种基于启发式算法和集合论的匿名化字典加速构建方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services

自引率

0.00%

发文量