当与去识别的人口统计数据共享时，共同隐私保护患者匹配策略的重新识别风险。

IF 4.6 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of the American Medical Informatics Association Pub Date : 2025-10-17 DOI:10.1093/jamia/ocaf183

Austin Eliazar, James Thomas Brown, Sara Cinamon, Murat Kantarcioglu, Bradley Malin

{"title":"当与去识别的人口统计数据共享时，共同隐私保护患者匹配策略的重新识别风险。","authors":"Austin Eliazar, James Thomas Brown, Sara Cinamon, Murat Kantarcioglu, Bradley Malin","doi":"10.1093/jamia/ocaf183","DOIUrl":null,"url":null,"abstract":"Objective: Privacy preserving record linkage (PPRL) refers to techniques used to identify which records refer to the same person across disparate datasets while safeguarding their identities. PPRL is increasingly relied upon to facilitate biomedical research. A common strategy encodes personally identifying information for comparison without disclosing underlying identifiers. As the scale of research datasets expands, it becomes crucial to reassess the privacy risks associated with these encodings. This paper highlights the potential re-identification risks of some of these encodings, demonstrating an attack that exploits encoding repetition across patients.Materials and methods: The attack leverages repeated PPRL encoding values combined with common demographics shared during PPRL in the clear (e.g., 3-digit ZIP code) to distinguish encodings from one another and ultimately link them to identities in a reference dataset. Using US Census statistics and voter registries, we empirically estimate encodings' re-identification risk against such an attack, while varying multiple factors that influence the risk.Results: Re-identification risk for PPRL encodings increases with population size, number of distinct encodings per patient, and amount of demographic information available. Commonly used encodings typically grow from <1% re-identification rate for datasets under one million individuals to 10%-20% for 250 million individuals.Discussion and conclusion: Re-identification risk often remains low in smaller populations, but increases significantly at the larger scales increasingly encountered today. These risks are common in many PPRL implementations, although, as our work shows, they are avoidable. Choosing better tokens or matching tokens through a third party without the underlying demographics effectively eliminates these risks.","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Re-identification risk for common privacy preserving patient matching strategies when shared with de-identified demographics.\",\"authors\":\"Austin Eliazar, James Thomas Brown, Sara Cinamon, Murat Kantarcioglu, Bradley Malin\",\"doi\":\"10.1093/jamia/ocaf183\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: Privacy preserving record linkage (PPRL) refers to techniques used to identify which records refer to the same person across disparate datasets while safeguarding their identities. PPRL is increasingly relied upon to facilitate biomedical research. A common strategy encodes personally identifying information for comparison without disclosing underlying identifiers. As the scale of research datasets expands, it becomes crucial to reassess the privacy risks associated with these encodings. This paper highlights the potential re-identification risks of some of these encodings, demonstrating an attack that exploits encoding repetition across patients.Materials and methods: The attack leverages repeated PPRL encoding values combined with common demographics shared during PPRL in the clear (e.g., 3-digit ZIP code) to distinguish encodings from one another and ultimately link them to identities in a reference dataset. Using US Census statistics and voter registries, we empirically estimate encodings' re-identification risk against such an attack, while varying multiple factors that influence the risk.Results: Re-identification risk for PPRL encodings increases with population size, number of distinct encodings per patient, and amount of demographic information available. Commonly used encodings typically grow from <1% re-identification rate for datasets under one million individuals to 10%-20% for 250 million individuals.Discussion and conclusion: Re-identification risk often remains low in smaller populations, but increases significantly at the larger scales increasingly encountered today. These risks are common in many PPRL implementations, although, as our work shows, they are avoidable. Choosing better tokens or matching tokens through a third party without the underlying demographics effectively eliminates these risks.\",\"PeriodicalId\":50016,\"journal\":{\"name\":\"Journal of the American Medical Informatics Association\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-10-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Medical Informatics Association\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.1093/jamia/ocaf183\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocaf183","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

目的：隐私保护记录链接（PPRL）是指用于识别哪些记录涉及不同数据集中的同一个人，同时保护其身份的技术。PPRL越来越多地用于促进生物医学研究。一种常见的策略是对个人标识信息进行编码，以便在不泄露底层标识符的情况下进行比较。随着研究数据集规模的扩大，重新评估与这些编码相关的隐私风险变得至关重要。本文强调了其中一些编码的潜在重新识别风险，展示了一种利用患者之间编码重复的攻击。材料和方法：攻击利用重复的PPRL编码值与PPRL期间共享的公共人口统计数据（例如，3位数的邮政编码）来区分编码，并最终将它们链接到参考数据集中的身份。使用美国人口普查统计数据和选民登记，我们在改变影响风险的多个因素的同时，对这种攻击的编码重新识别风险进行了经验估计。结果：PPRL编码的再识别风险随着人群规模、每位患者不同编码的数量和可获得的人口统计信息的数量而增加。常用的编码通常是从讨论和结论中得出的：在较小的人群中，重新识别的风险通常仍然很低，但在今天日益遇到的更大范围中，风险会显著增加。这些风险在许多PPRL实现中是常见的，尽管，正如我们的工作所示，它们是可以避免的。选择更好的代币或通过第三方匹配代币，而不需要潜在的人口统计数据，有效地消除了这些风险。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Re-identification risk for common privacy preserving patient matching strategies when shared with de-identified demographics.

Objective: Privacy preserving record linkage (PPRL) refers to techniques used to identify which records refer to the same person across disparate datasets while safeguarding their identities. PPRL is increasingly relied upon to facilitate biomedical research. A common strategy encodes personally identifying information for comparison without disclosing underlying identifiers. As the scale of research datasets expands, it becomes crucial to reassess the privacy risks associated with these encodings. This paper highlights the potential re-identification risks of some of these encodings, demonstrating an attack that exploits encoding repetition across patients.

Materials and methods: The attack leverages repeated PPRL encoding values combined with common demographics shared during PPRL in the clear (e.g., 3-digit ZIP code) to distinguish encodings from one another and ultimately link them to identities in a reference dataset. Using US Census statistics and voter registries, we empirically estimate encodings' re-identification risk against such an attack, while varying multiple factors that influence the risk.

Results: Re-identification risk for PPRL encodings increases with population size, number of distinct encodings per patient, and amount of demographic information available. Commonly used encodings typically grow from <1% re-identification rate for datasets under one million individuals to 10%-20% for 250 million individuals.

Discussion and conclusion: Re-identification risk often remains low in smaller populations, but increases significantly at the larger scales increasingly encountered today. These risks are common in many PPRL implementations, although, as our work shows, they are avoidable. Choosing better tokens or matching tokens through a third party without the underlying demographics effectively eliminates these risks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the American Medical Informatics Association 医学-计算机：跨学科应用

CiteScore

14.50

自引率

7.80%

发文量

230

审稿时长

3-8 weeks

期刊介绍： JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.