Re-identification risk for common privacy preserving patient matching strategies when shared with de-identified demographics.

IF 4.6 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Austin Eliazar, James Thomas Brown, Sara Cinamon, Murat Kantarcioglu, Bradley Malin
{"title":"Re-identification risk for common privacy preserving patient matching strategies when shared with de-identified demographics.","authors":"Austin Eliazar, James Thomas Brown, Sara Cinamon, Murat Kantarcioglu, Bradley Malin","doi":"10.1093/jamia/ocaf183","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Privacy preserving record linkage (PPRL) refers to techniques used to identify which records refer to the same person across disparate datasets while safeguarding their identities. PPRL is increasingly relied upon to facilitate biomedical research. A common strategy encodes personally identifying information for comparison without disclosing underlying identifiers. As the scale of research datasets expands, it becomes crucial to reassess the privacy risks associated with these encodings. This paper highlights the potential re-identification risks of some of these encodings, demonstrating an attack that exploits encoding repetition across patients.</p><p><strong>Materials and methods: </strong>The attack leverages repeated PPRL encoding values combined with common demographics shared during PPRL in the clear (e.g., 3-digit ZIP code) to distinguish encodings from one another and ultimately link them to identities in a reference dataset. Using US Census statistics and voter registries, we empirically estimate encodings' re-identification risk against such an attack, while varying multiple factors that influence the risk.</p><p><strong>Results: </strong>Re-identification risk for PPRL encodings increases with population size, number of distinct encodings per patient, and amount of demographic information available. Commonly used encodings typically grow from <1% re-identification rate for datasets under one million individuals to 10%-20% for 250 million individuals.</p><p><strong>Discussion and conclusion: </strong>Re-identification risk often remains low in smaller populations, but increases significantly at the larger scales increasingly encountered today. These risks are common in many PPRL implementations, although, as our work shows, they are avoidable. Choosing better tokens or matching tokens through a third party without the underlying demographics effectively eliminates these risks.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf183","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: Privacy preserving record linkage (PPRL) refers to techniques used to identify which records refer to the same person across disparate datasets while safeguarding their identities. PPRL is increasingly relied upon to facilitate biomedical research. A common strategy encodes personally identifying information for comparison without disclosing underlying identifiers. As the scale of research datasets expands, it becomes crucial to reassess the privacy risks associated with these encodings. This paper highlights the potential re-identification risks of some of these encodings, demonstrating an attack that exploits encoding repetition across patients.

Materials and methods: The attack leverages repeated PPRL encoding values combined with common demographics shared during PPRL in the clear (e.g., 3-digit ZIP code) to distinguish encodings from one another and ultimately link them to identities in a reference dataset. Using US Census statistics and voter registries, we empirically estimate encodings' re-identification risk against such an attack, while varying multiple factors that influence the risk.

Results: Re-identification risk for PPRL encodings increases with population size, number of distinct encodings per patient, and amount of demographic information available. Commonly used encodings typically grow from <1% re-identification rate for datasets under one million individuals to 10%-20% for 250 million individuals.

Discussion and conclusion: Re-identification risk often remains low in smaller populations, but increases significantly at the larger scales increasingly encountered today. These risks are common in many PPRL implementations, although, as our work shows, they are avoidable. Choosing better tokens or matching tokens through a third party without the underlying demographics effectively eliminates these risks.

当与去识别的人口统计数据共享时,共同隐私保护患者匹配策略的重新识别风险。
目的:隐私保护记录链接(PPRL)是指用于识别哪些记录涉及不同数据集中的同一个人,同时保护其身份的技术。PPRL越来越多地用于促进生物医学研究。一种常见的策略是对个人标识信息进行编码,以便在不泄露底层标识符的情况下进行比较。随着研究数据集规模的扩大,重新评估与这些编码相关的隐私风险变得至关重要。本文强调了其中一些编码的潜在重新识别风险,展示了一种利用患者之间编码重复的攻击。材料和方法:攻击利用重复的PPRL编码值与PPRL期间共享的公共人口统计数据(例如,3位数的邮政编码)来区分编码,并最终将它们链接到参考数据集中的身份。使用美国人口普查统计数据和选民登记,我们在改变影响风险的多个因素的同时,对这种攻击的编码重新识别风险进行了经验估计。结果:PPRL编码的再识别风险随着人群规模、每位患者不同编码的数量和可获得的人口统计信息的数量而增加。常用的编码通常是从讨论和结论中得出的:在较小的人群中,重新识别的风险通常仍然很低,但在今天日益遇到的更大范围中,风险会显著增加。这些风险在许多PPRL实现中是常见的,尽管,正如我们的工作所示,它们是可以避免的。选择更好的代币或通过第三方匹配代币,而不需要潜在的人口统计数据,有效地消除了这些风险。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of the American Medical Informatics Association
Journal of the American Medical Informatics Association 医学-计算机:跨学科应用
CiteScore
14.50
自引率
7.80%
发文量
230
审稿时长
3-8 weeks
期刊介绍: JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信