评估用于非结构化健康记录的自动化和半自动匿名化工具的准确性。

Surgical neurology international Pub Date : 2025-08-01 eCollection Date: 2025-01-01 DOI:10.25259/SNI_459_2025

Layan Abdulilah Alrazihi, Sayan Biswas, Joshi George

{"title":"评估用于非结构化健康记录的自动化和半自动匿名化工具的准确性。","authors":"Layan Abdulilah Alrazihi, Sayan Biswas, Joshi George","doi":"10.25259/SNI_459_2025","DOIUrl":null,"url":null,"abstract":"Background: Utilization of unstructured clinical text in research is limited by the presence of protected health identifiers (PHI) within the text. To maintain patient privacy, PHI must be de-identified. The use of anonymization tools such as Microsoft Presidio and Philter has been recognized as a potential solution to the challenges of manual de-identification. Therefore, the primary objective of this study is to evaluate the accuracy and feasibility of using Microsoft Presidio and Philter in de-identifying unstructured clinical text.Methods: A sample of 200 neurosurgical documents, temporally distributed across 10 years, was extracted. The data were processed by Microsoft Presidio and Philter. Each document was manually screened for the ground truth which was used as a reference point to evaluate the accuracy of each tool. Data analysis was conducted using Python.Results: A median of 8 PHI were manually de-identified per document. Both tools were individually capable of de-identifying a median of 6 PHI per document. Each tool de-identified PHI with an accuracy of 96%. Presidio demonstrated precision of 0.51 and a recall of 0.74, while Philter had precision and recall of 0.35 and 0.79, respectively.Conclusion: The performance of each tool supports their use in anonymizing unstructured clinical text. Formatting variations between texts limited the performance of both tools. To conclude, further research is required to optimize the tools' output and assess the reliability in de-identifying diverse and previously unseen clinical text, thus allowing the use of unstructured clinical text in medical research.","PeriodicalId":94217,"journal":{"name":"Surgical neurology international","volume":"16 ","pages":"313"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12477974/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating the accuracy of automated and semi-automated anonymization tools for unstructured health records.\",\"authors\":\"Layan Abdulilah Alrazihi, Sayan Biswas, Joshi George\",\"doi\":\"10.25259/SNI_459_2025\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Utilization of unstructured clinical text in research is limited by the presence of protected health identifiers (PHI) within the text. To maintain patient privacy, PHI must be de-identified. The use of anonymization tools such as Microsoft Presidio and Philter has been recognized as a potential solution to the challenges of manual de-identification. Therefore, the primary objective of this study is to evaluate the accuracy and feasibility of using Microsoft Presidio and Philter in de-identifying unstructured clinical text.Methods: A sample of 200 neurosurgical documents, temporally distributed across 10 years, was extracted. The data were processed by Microsoft Presidio and Philter. Each document was manually screened for the ground truth which was used as a reference point to evaluate the accuracy of each tool. Data analysis was conducted using Python.Results: A median of 8 PHI were manually de-identified per document. Both tools were individually capable of de-identifying a median of 6 PHI per document. Each tool de-identified PHI with an accuracy of 96%. Presidio demonstrated precision of 0.51 and a recall of 0.74, while Philter had precision and recall of 0.35 and 0.79, respectively.Conclusion: The performance of each tool supports their use in anonymizing unstructured clinical text. Formatting variations between texts limited the performance of both tools. To conclude, further research is required to optimize the tools' output and assess the reliability in de-identifying diverse and previously unseen clinical text, thus allowing the use of unstructured clinical text in medical research.\",\"PeriodicalId\":94217,\"journal\":{\"name\":\"Surgical neurology international\",\"volume\":\"16 \",\"pages\":\"313\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12477974/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Surgical neurology international\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.25259/SNI_459_2025\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgical neurology international","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25259/SNI_459_2025","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

背景：研究中非结构化临床文本的利用受到文本中受保护健康标识符（PHI）存在的限制。为了维护患者隐私，PHI必须去标识化。使用匿名化工具（如Microsoft Presidio和Philter）已被认为是解决手动去识别挑战的潜在解决方案。因此，本研究的主要目的是评估使用Microsoft Presidio和Philter去识别非结构化临床文本的准确性和可行性。方法：选取时间跨度为10年的神经外科文献200份。数据由Microsoft Presidio和Philter处理。每个文档都是手动筛选的，作为评估每个工具准确性的参考点。使用Python进行数据分析。结果：每个文档中位数为8个PHI被手动去识别。这两种工具都能够在每个文档中去识别6个PHI。每个工具去识别PHI的准确率为96%。Presidio的准确率为0.51，查全率为0.74，Philter的准确率和查全率分别为0.35和0.79。结论：每个工具的性能都支持它们在匿名化非结构化临床文本中的使用。文本之间的格式差异限制了这两种工具的性能。总之，需要进一步的研究来优化工具的输出，并评估在去识别多样化和以前未见过的临床文本方面的可靠性，从而允许在医学研究中使用非结构化临床文本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating the accuracy of automated and semi-automated anonymization tools for unstructured health records.

Background: Utilization of unstructured clinical text in research is limited by the presence of protected health identifiers (PHI) within the text. To maintain patient privacy, PHI must be de-identified. The use of anonymization tools such as Microsoft Presidio and Philter has been recognized as a potential solution to the challenges of manual de-identification. Therefore, the primary objective of this study is to evaluate the accuracy and feasibility of using Microsoft Presidio and Philter in de-identifying unstructured clinical text.

Methods: A sample of 200 neurosurgical documents, temporally distributed across 10 years, was extracted. The data were processed by Microsoft Presidio and Philter. Each document was manually screened for the ground truth which was used as a reference point to evaluate the accuracy of each tool. Data analysis was conducted using Python.

Results: A median of 8 PHI were manually de-identified per document. Both tools were individually capable of de-identifying a median of 6 PHI per document. Each tool de-identified PHI with an accuracy of 96%. Presidio demonstrated precision of 0.51 and a recall of 0.74, while Philter had precision and recall of 0.35 and 0.79, respectively.

Conclusion: The performance of each tool supports their use in anonymizing unstructured clinical text. Formatting variations between texts limited the performance of both tools. To conclude, further research is required to optimize the tools' output and assess the reliability in de-identifying diverse and previously unseen clinical text, thus allowing the use of unstructured clinical text in medical research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Surgical neurology international

自引率

0.00%

发文量