MedPromptExtract (Medical Data Extraction Tool): Anonymization and High-Fidelity Automated Data Extraction Using Natural Language Processing and Prompt Engineering.
{"title":"MedPromptExtract (Medical Data Extraction Tool): Anonymization and High-Fidelity Automated Data Extraction Using Natural Language Processing and Prompt Engineering.","authors":"Roomani Srivastava, Lipika Bhat, Suraj Prasad, Sarvesh Deshpande, Barnali Das, Kshitij Jadhav","doi":"10.1093/jalm/jfaf034","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The labor-intensive nature of data extraction from sources like discharge summaries (DSs) poses significant obstacles to the digitization of medical records particularly for low- and middle-income countries (LMICs). In this paper we present a completely automated method, MedPromptExtract, to efficiently extract data from DS while maintaining confidentiality.</p><p><strong>Methods: </strong>The source of data were DSs from Kokilaben Dhirubhai Ambani Hospital (KDAH) of patients having acute kidney injury (AKI). A pre-existing tool, Expert-Informed Joint Learning aGgrEatioN (EIGEN), which leverages semi-supervised learning techniques for high-fidelity information extraction, was used to anonymize the DSs, and natural language processing (NLP) was used to extract data from regular fields. We used prompt engineering and a large language model (LLM) to extract custom clinical information from free-flowing text describing the patient's stay in the hospital. Twelve features associated with the occurrence of AKI were extracted. The LLM's responses were validated against clinicians' annotations.</p><p><strong>Results: </strong>The MedPromptExtract tool first subjected DSs to the anonymization pipeline, which took 3 seconds per summary. Successful anonymization was verified by clinicians, thereafter the NLP pipeline extracted structured text from the anonymized pdfs at the rate of 0.2 s per summary with 100% accuracy. Finally, DSs were analysed by the LLM pipeline using Gemini Pro for the 12 features. Accuracy metrics were calculated by comparing model responses to clinicians' annotations with 7 features achieving Area Under the Curve (AUC) above 0.9, indicating the high fidelity of the extraction process.</p><p><strong>Conclusions: </strong>MedPromptExtract serves as an automated adaptable tool for efficient data extraction from medical records with a dynamic user interface.</p>","PeriodicalId":46361,"journal":{"name":"Journal of Applied Laboratory Medicine","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Applied Laboratory Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jalm/jfaf034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MEDICAL LABORATORY TECHNOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The labor-intensive nature of data extraction from sources like discharge summaries (DSs) poses significant obstacles to the digitization of medical records particularly for low- and middle-income countries (LMICs). In this paper we present a completely automated method, MedPromptExtract, to efficiently extract data from DS while maintaining confidentiality.
Methods: The source of data were DSs from Kokilaben Dhirubhai Ambani Hospital (KDAH) of patients having acute kidney injury (AKI). A pre-existing tool, Expert-Informed Joint Learning aGgrEatioN (EIGEN), which leverages semi-supervised learning techniques for high-fidelity information extraction, was used to anonymize the DSs, and natural language processing (NLP) was used to extract data from regular fields. We used prompt engineering and a large language model (LLM) to extract custom clinical information from free-flowing text describing the patient's stay in the hospital. Twelve features associated with the occurrence of AKI were extracted. The LLM's responses were validated against clinicians' annotations.
Results: The MedPromptExtract tool first subjected DSs to the anonymization pipeline, which took 3 seconds per summary. Successful anonymization was verified by clinicians, thereafter the NLP pipeline extracted structured text from the anonymized pdfs at the rate of 0.2 s per summary with 100% accuracy. Finally, DSs were analysed by the LLM pipeline using Gemini Pro for the 12 features. Accuracy metrics were calculated by comparing model responses to clinicians' annotations with 7 features achieving Area Under the Curve (AUC) above 0.9, indicating the high fidelity of the extraction process.
Conclusions: MedPromptExtract serves as an automated adaptable tool for efficient data extraction from medical records with a dynamic user interface.