使用大型语言模型评估临床文本匿名化的对抗性患者再识别。

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Pub Date : 2025-06-10 eCollection Date: 2025-01-01

John X Morris, Thomas R Campion, Sri Laasya Nutheti, Yifan Peng, Akhil Raj, Ramin Zabih, Curtis L Cole

{"title":"使用大型语言模型评估临床文本匿名化的对抗性患者再识别。","authors":"John X Morris, Thomas R Campion, Sri Laasya Nutheti, Yifan Peng, Akhil Raj, Ramin Zabih, Curtis L Cole","doi":"","DOIUrl":null,"url":null,"abstract":"Sharing protected health information (PHI) is critical for furthering biomedical research. Before data can be distributed, practitioners often perform deidentification to remove any PHI contained in the text. Contemporary deidentification methods are evaluated on highly saturated datasets (tools achieve near-perfect accuracy) which may not reflect the full variability or complexity of real-world clinical text and annotating them is resource intensive, which is a barrier to real-world applications. To address this gap, we developed an adversarial approach using a large language model (LLM) to re-identify the patient corresponding to a redacted clinical note and evaluated the performance with a novel De-Identification/Re-Identification (DIRI) method. Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note. We demonstrate our method on medical data from Weill Cornell Medicine anonymized with three deidentification tools: rule-based Philter and two deep-learning-based models, BiLSTM-CRF and ClinicalBERT. Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes Our study highlights significant weaknesses in current deidentification technologies while providing a tool for iterative development and improvement.","PeriodicalId":72181,"journal":{"name":"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science","volume":"2025 ","pages":"355-364"},"PeriodicalIF":0.0000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12150728/pdf/","citationCount":"0","resultStr":"{\"title\":\"DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization.\",\"authors\":\"John X Morris, Thomas R Campion, Sri Laasya Nutheti, Yifan Peng, Akhil Raj, Ramin Zabih, Curtis L Cole\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sharing protected health information (PHI) is critical for furthering biomedical research. Before data can be distributed, practitioners often perform deidentification to remove any PHI contained in the text. Contemporary deidentification methods are evaluated on highly saturated datasets (tools achieve near-perfect accuracy) which may not reflect the full variability or complexity of real-world clinical text and annotating them is resource intensive, which is a barrier to real-world applications. To address this gap, we developed an adversarial approach using a large language model (LLM) to re-identify the patient corresponding to a redacted clinical note and evaluated the performance with a novel De-Identification/Re-Identification (DIRI) method. Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note. We demonstrate our method on medical data from Weill Cornell Medicine anonymized with three deidentification tools: rule-based Philter and two deep-learning-based models, BiLSTM-CRF and ClinicalBERT. Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes Our study highlights significant weaknesses in current deidentification technologies while providing a tool for iterative development and improvement.\",\"PeriodicalId\":72181,\"journal\":{\"name\":\"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science\",\"volume\":\"2025 \",\"pages\":\"355-364\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12150728/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

共享受保护的健康信息（PHI）对于进一步开展生物医学研究至关重要。在数据可以分发之前，从业者通常执行去识别以删除文本中包含的任何PHI。当代去识别方法是在高度饱和的数据集上进行评估的（工具达到近乎完美的准确性），这些数据集可能无法反映真实世界临床文本的全部可变性或复杂性，并且注释它们是资源密集型的，这是现实世界应用的障碍。为了解决这一差距，我们开发了一种使用大型语言模型（LLM）的对抗性方法，根据编辑的临床记录重新识别患者，并使用一种新的去识别/重新识别（DIRI）方法评估其性能。我们的方法使用大型语言模型来重新识别与编辑的临床记录相对应的患者。我们使用三种去识别工具（基于规则的Philter和两个基于深度学习的模型，BiLSTM-CRF和ClinicalBERT）对来自威尔康奈尔医学的医疗数据进行了匿名化处理，展示了我们的方法。虽然ClinicalBERT是最有效的，掩盖了所有已识别的PII，但我们的工具仍然重新识别了9%的临床记录。我们的研究强调了当前去识别技术的重大弱点，同时提供了迭代开发和改进的工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

本刊更多论文

DIRI: Adversarial Patient Reidentification with Large Language Models for Evaluating Clinical Text Anonymization.

Sharing protected health information (PHI) is critical for furthering biomedical research. Before data can be distributed, practitioners often perform deidentification to remove any PHI contained in the text. Contemporary deidentification methods are evaluated on highly saturated datasets (tools achieve near-perfect accuracy) which may not reflect the full variability or complexity of real-world clinical text and annotating them is resource intensive, which is a barrier to real-world applications. To address this gap, we developed an adversarial approach using a large language model (LLM) to re-identify the patient corresponding to a redacted clinical note and evaluated the performance with a novel De-Identification/Re-Identification (DIRI) method. Our method uses a large language model to reidentify the patient corresponding to a redacted clinical note. We demonstrate our method on medical data from Weill Cornell Medicine anonymized with three deidentification tools: rule-based Philter and two deep-learning-based models, BiLSTM-CRF and ClinicalBERT. Although ClinicalBERT was the most effective, masking all identified PII, our tool still reidentified 9% of clinical notes Our study highlights significant weaknesses in current deidentification technologies while providing a tool for iterative development and improvement.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science

自引率

0.00%

发文量