Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery Pub Date : 2024-09-03 DOI:10.1007/s10618-024-01066-3

Benet Manzanares-Salor, David Sánchez, Pierre Lison

{"title":"Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack","authors":"Benet Manzanares-Salor, David Sánchez, Pierre Lison","doi":"10.1007/s10618-024-01066-3","DOIUrl":null,"url":null,"abstract":"<p>The availability of textual data depicting human-centered features and behaviors is crucial for many data mining and machine learning tasks. However, data containing personal information should be anonymized prior making them available for secondary use. A variety of text anonymization methods have been proposed in the last years, which are standardly evaluated by comparing their outputs with human-based anonymizations. The residual disclosure risk is estimated with the recall metric, which quantifies the proportion of manually annotated re-identifying terms successfully detected by the anonymization algorithm. Nevertheless, recall is not a risk metric, which leads to several drawbacks. First, it requires a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, it relies on human judgements, which are inherently subjective and prone to errors. Finally, the recall metric weights terms uniformly, thereby ignoring the fact that the influence on the disclosure risk of some missed terms may be much larger than of others. To overcome these drawbacks, in this paper we propose a novel method to evaluate the disclosure risk of anonymized texts by means of an automated re-identification attack. We formalize the attack as a multi-class classification task and leverage state-of-the-art neural language models to aggregate the data sources that attackers may use to build the classifier. We illustrate the effectiveness of our method by assessing the disclosure risk of several methods for text anonymization under different attack configurations. Empirical results show substantial privacy risks for most existing anonymization methods.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"8 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01066-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The availability of textual data depicting human-centered features and behaviors is crucial for many data mining and machine learning tasks. However, data containing personal information should be anonymized prior making them available for secondary use. A variety of text anonymization methods have been proposed in the last years, which are standardly evaluated by comparing their outputs with human-based anonymizations. The residual disclosure risk is estimated with the recall metric, which quantifies the proportion of manually annotated re-identifying terms successfully detected by the anonymization algorithm. Nevertheless, recall is not a risk metric, which leads to several drawbacks. First, it requires a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, it relies on human judgements, which are inherently subjective and prone to errors. Finally, the recall metric weights terms uniformly, thereby ignoring the fact that the influence on the disclosure risk of some missed terms may be much larger than of others. To overcome these drawbacks, in this paper we propose a novel method to evaluate the disclosure risk of anonymized texts by means of an automated re-identification attack. We formalize the attack as a multi-class classification task and leverage state-of-the-art neural language models to aggregate the data sources that attackers may use to build the classifier. We illustrate the effectiveness of our method by assessing the disclosure risk of several methods for text anonymization under different attack configurations. Empirical results show substantial privacy risks for most existing anonymization methods.

Abstract Image

查看原文本刊更多论文

通过基于机器学习的再识别攻击评估匿名文件的披露风险

对于许多数据挖掘和机器学习任务来说，提供描述以人为中心的特征和行为的文本数据至关重要。然而，包含个人信息的数据在二次使用前应进行匿名处理。过去几年中提出了多种文本匿名化方法，这些方法的标准评估方法是将其输出结果与基于人的匿名化方法进行比较。残余披露风险是用召回率指标来估算的，它量化了匿名算法成功检测到的人工注释的重新识别术语的比例。然而，召回率并不是一种风险度量，它有几个缺点。首先，它需要一个唯一的基本事实，而这对于文本匿名化来说并不成立，因为在文本匿名化中，有几种掩码选择可能同样有效，以防止重新识别。其次，它依赖于人的判断，而人的判断本身是主观的，容易出错。最后，召回度量对术语的加权是统一的，因此忽略了这样一个事实，即某些遗漏术语对披露风险的影响可能比其他术语大得多。为了克服这些缺点，我们在本文中提出了一种新方法，通过自动再识别攻击来评估匿名文本的披露风险。我们将攻击形式化为多类分类任务，并利用最先进的神经语言模型来汇总攻击者可能用于构建分类器的数据源。我们通过评估几种文本匿名化方法在不同攻击配置下的泄露风险来说明我们方法的有效性。实证结果表明，大多数现有的匿名化方法都存在很大的隐私风险。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data Mining and Knowledge Discovery 工程技术-计算机：人工智能

CiteScore

10.40

自引率

4.20%

发文量

审稿时长

10 months

期刊介绍： Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.