MedPromptExtract (Medical Data Extraction Tool): Anonymization and High-Fidelity Automated Data Extraction Using Natural Language Processing and Prompt Engineering.

IF 1.8 Q3 MEDICAL LABORATORY TECHNOLOGY

Journal of Applied Laboratory Medicine Pub Date : 2025-03-29 DOI:10.1093/jalm/jfaf034

Roomani Srivastava, Lipika Bhat, Suraj Prasad, Sarvesh Deshpande, Barnali Das, Kshitij Jadhav

{"title":"MedPromptExtract (Medical Data Extraction Tool): Anonymization and High-Fidelity Automated Data Extraction Using Natural Language Processing and Prompt Engineering.","authors":"Roomani Srivastava, Lipika Bhat, Suraj Prasad, Sarvesh Deshpande, Barnali Das, Kshitij Jadhav","doi":"10.1093/jalm/jfaf034","DOIUrl":null,"url":null,"abstract":"Background: The labor-intensive nature of data extraction from sources like discharge summaries (DSs) poses significant obstacles to the digitization of medical records particularly for low- and middle-income countries (LMICs). In this paper we present a completely automated method, MedPromptExtract, to efficiently extract data from DS while maintaining confidentiality.Methods: The source of data were DSs from Kokilaben Dhirubhai Ambani Hospital (KDAH) of patients having acute kidney injury (AKI). A pre-existing tool, Expert-Informed Joint Learning aGgrEatioN (EIGEN), which leverages semi-supervised learning techniques for high-fidelity information extraction, was used to anonymize the DSs, and natural language processing (NLP) was used to extract data from regular fields. We used prompt engineering and a large language model (LLM) to extract custom clinical information from free-flowing text describing the patient's stay in the hospital. Twelve features associated with the occurrence of AKI were extracted. The LLM's responses were validated against clinicians' annotations.Results: The MedPromptExtract tool first subjected DSs to the anonymization pipeline, which took 3 seconds per summary. Successful anonymization was verified by clinicians, thereafter the NLP pipeline extracted structured text from the anonymized pdfs at the rate of 0.2 s per summary with 100% accuracy. Finally, DSs were analysed by the LLM pipeline using Gemini Pro for the 12 features. Accuracy metrics were calculated by comparing model responses to clinicians' annotations with 7 features achieving Area Under the Curve (AUC) above 0.9, indicating the high fidelity of the extraction process.Conclusions: MedPromptExtract serves as an automated adaptable tool for efficient data extraction from medical records with a dynamic user interface.","PeriodicalId":46361,"journal":{"name":"Journal of Applied Laboratory Medicine","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Laboratory Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jalm/jfaf034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MEDICAL LABORATORY TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The labor-intensive nature of data extraction from sources like discharge summaries (DSs) poses significant obstacles to the digitization of medical records particularly for low- and middle-income countries (LMICs). In this paper we present a completely automated method, MedPromptExtract, to efficiently extract data from DS while maintaining confidentiality.

Methods: The source of data were DSs from Kokilaben Dhirubhai Ambani Hospital (KDAH) of patients having acute kidney injury (AKI). A pre-existing tool, Expert-Informed Joint Learning aGgrEatioN (EIGEN), which leverages semi-supervised learning techniques for high-fidelity information extraction, was used to anonymize the DSs, and natural language processing (NLP) was used to extract data from regular fields. We used prompt engineering and a large language model (LLM) to extract custom clinical information from free-flowing text describing the patient's stay in the hospital. Twelve features associated with the occurrence of AKI were extracted. The LLM's responses were validated against clinicians' annotations.

Results: The MedPromptExtract tool first subjected DSs to the anonymization pipeline, which took 3 seconds per summary. Successful anonymization was verified by clinicians, thereafter the NLP pipeline extracted structured text from the anonymized pdfs at the rate of 0.2 s per summary with 100% accuracy. Finally, DSs were analysed by the LLM pipeline using Gemini Pro for the 12 features. Accuracy metrics were calculated by comparing model responses to clinicians' annotations with 7 features achieving Area Under the Curve (AUC) above 0.9, indicating the high fidelity of the extraction process.

Conclusions: MedPromptExtract serves as an automated adaptable tool for efficient data extraction from medical records with a dynamic user interface.

查看原文本刊更多论文

MedPromptExtract（医疗数据提取工具）：使用自然语言处理和提示工程进行匿名化和高保真自动数据提取。

背景：从出院摘要（DSs）等来源提取数据的劳动密集型性质对医疗记录的数字化构成了重大障碍，特别是在低收入和中等收入国家（LMICs）。在本文中，我们提出了一个完全自动化的方法，MedPromptExtract，以有效地从DS中提取数据，同时保持机密性。方法：资料来源于Kokilaben Dhirubhai Ambani医院（KDAH）急性肾损伤（AKI）患者的ds。利用半监督学习技术进行高保真信息提取的已有工具Expert-Informed Joint Learning aggregation （EIGEN）用于对决策决策表进行匿名化处理，并使用自然语言处理（NLP）从常规字段中提取数据。我们使用提示工程和大型语言模型（LLM）从描述患者住院情况的自由流动文本中提取定制的临床信息。提取与AKI发生相关的12个特征。法学硕士的回答与临床医生的注释进行了验证。结果：MedPromptExtract工具首先将DSs置于匿名化管道中，每个摘要耗时3秒。临床医生验证匿名化成功后，NLP流水线以每个摘要0.2秒的速度从匿名pdf中提取结构化文本，准确率为100%。最后，利用Gemini Pro对12个特征进行了LLM流水线分析。通过比较模型反应与临床医生的注释来计算准确性指标，其中7个特征的曲线下面积（AUC）超过0.9，表明提取过程的高保真度。结论：MedPromptExtract是一种自动化的适应性工具，具有动态用户界面，可有效地从医疗记录中提取数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Applied Laboratory Medicine MEDICAL LABORATORY TECHNOLOGY-

CiteScore

3.70

自引率

5.00%

发文量

137