Comparison of anonymization techniques regarding statistical reproducibility.

PLOS digital health Pub Date : 2025-02-03 eCollection Date: 2025-02-01 DOI:10.1371/journal.pdig.0000735

David Pau, Camille Bachot, Charles Monteil, Laetitia Vinet, Mathieu Boucher, Nadir Sella, Romain Jegou

{"title":"Comparison of anonymization techniques regarding statistical reproducibility.","authors":"David Pau, Camille Bachot, Charles Monteil, Laetitia Vinet, Mathieu Boucher, Nadir Sella, Romain Jegou","doi":"10.1371/journal.pdig.0000735","DOIUrl":null,"url":null,"abstract":"Background: Anonymization opens up innovative ways of using secondary data without the requirements of the GDPR, as anonymized data does not affect anymore the privacy of data subjects. Anonymization requires data alteration, and this project aims to compare the ability of such privacy protection methods to maintain reliability and utility of scientific data for secondary research purposes.Methods: The French data protection authority (CNIL) defines anonymization as a processing activity that consists of using methods to make impossible any identification of people by any means in an irreversible manner. To answer project's objective, a series of analyses were performed on a cohort, and reproduced on four sets of anonymized data for comparison. Four assessment levels were used to evaluate impact of anonymization: level 1 referred to the replication of statistical outputs, level 2 referred to accuracy of statistical results, level 3 assessed data alteration (using Hellinger distances) and level 4 assessed privacy risks (using WP29 criteria).Results: 87 items were produced on the raw cohort data and then reproduced on each of the four anonymized data. The overall level 1 replication score ranged from 67% to 100% depending on the anonymization solution. The most difficult analyses to replicate were regression models (sub-score ranging from 78% to 100%) and survival analysis (sub-score ranging from 0% to 100. The overall level 2 accuracy score ranged from 22% to 79% depending on the anonymization solution. For level 3, three methods had some variables with different probability distributions (Hellinger distance = 1). For level 4, all methods had reduced the privacy risk of singling out, with relative risk reductions ranging from 41% to 65%.Conclusion: None of the anonymization methods reproduced all outputs and results. A trade-off has to be find between context risk and the usefulness of data to answer the research question.","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"4 2","pages":"e0000735"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11790161/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Anonymization opens up innovative ways of using secondary data without the requirements of the GDPR, as anonymized data does not affect anymore the privacy of data subjects. Anonymization requires data alteration, and this project aims to compare the ability of such privacy protection methods to maintain reliability and utility of scientific data for secondary research purposes.

Methods: The French data protection authority (CNIL) defines anonymization as a processing activity that consists of using methods to make impossible any identification of people by any means in an irreversible manner. To answer project's objective, a series of analyses were performed on a cohort, and reproduced on four sets of anonymized data for comparison. Four assessment levels were used to evaluate impact of anonymization: level 1 referred to the replication of statistical outputs, level 2 referred to accuracy of statistical results, level 3 assessed data alteration (using Hellinger distances) and level 4 assessed privacy risks (using WP29 criteria).

Results: 87 items were produced on the raw cohort data and then reproduced on each of the four anonymized data. The overall level 1 replication score ranged from 67% to 100% depending on the anonymization solution. The most difficult analyses to replicate were regression models (sub-score ranging from 78% to 100%) and survival analysis (sub-score ranging from 0% to 100. The overall level 2 accuracy score ranged from 22% to 79% depending on the anonymization solution. For level 3, three methods had some variables with different probability distributions (Hellinger distance = 1). For level 4, all methods had reduced the privacy risk of singling out, with relative risk reductions ranging from 41% to 65%.

Conclusion: None of the anonymization methods reproduced all outputs and results. A trade-off has to be find between context risk and the usefulness of data to answer the research question.

查看原文本刊更多论文

关于统计再现性的匿名化技术的比较。

背景：匿名化在不受GDPR要求的情况下开辟了使用二手数据的创新方式，因为匿名数据不再影响数据主体的隐私。匿名化需要改变数据，本项目旨在比较这些隐私保护方法在维护二级研究目的的科学数据的可靠性和实用性方面的能力。方法：法国数据保护局（CNIL）将匿名化定义为一种处理活动，包括使用不可能以任何方式以不可逆转的方式识别人员的方法。为了回答项目的目标，对一个队列进行了一系列分析，并在四组匿名数据上进行了复制以进行比较。使用四个评估级别来评估匿名化的影响：级别1指统计输出的复制，级别2指统计结果的准确性，级别3评估数据更改（使用海灵格距离），级别4评估隐私风险（使用WP29标准）。结果：在原始队列数据上产生了87个项目，然后在四个匿名数据上复制。根据匿名化解决方案的不同，1级复制的总体得分从67%到100%不等。最难重复的分析是回归模型（子评分范围从78%到100%）和生存分析（子评分范围从0%到100）。根据匿名化解决方案的不同，2级的总体准确率得分从22%到79%不等。对于第3级，3种方法存在一些概率分布不同的变量（Hellinger距离= 1）。对于第4级，所有方法都降低了挑出的隐私风险，相对风险降低幅度在41% ~ 65%之间。结论：没有一种匿名化方法可以再现所有的输出和结果。为了回答研究问题，必须在环境风险和数据的有用性之间找到一个权衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLOS digital health

自引率

0.00%

发文量