Leveraging Large Language Models for Accurate Retrieval of Patient Information From Medical Reports: Systematic Evaluation Study.

IF 2

JMIR AI Pub Date : 2025-07-03 DOI:10.2196/68776

Angel Manuel Garcia-Carmona, Maria-Lorena Prieto, Enrique Puertas, Juan-Jose Beunza

{"title":"Leveraging Large Language Models for Accurate Retrieval of Patient Information From Medical Reports: Systematic Evaluation Study.","authors":"Angel Manuel Garcia-Carmona, Maria-Lorena Prieto, Enrique Puertas, Juan-Jose Beunza","doi":"10.2196/68776","DOIUrl":null,"url":null,"abstract":"Background: The digital transformation of health care has introduced both opportunities and challenges, particularly in managing and analyzing the vast amounts of unstructured medical data generated daily. There is a need to explore the feasibility of generative solutions in extracting data from medical reports, categorized by specific criteria.Objective: This study aimed to investigate the application of large language models (LLMs) for the automated extraction of structured information from unstructured medical reports, using the LangChain framework in Python.Methods: Through a systematic evaluation of leading LLMs-GPT-4o, Llama 3, Llama 3.1, Gemma 2, Qwen 2, and Qwen 2.5-using zero-shot prompting techniques and embedding results into a vector database, this study assessed the performance of LLMs in extracting patient demographics, diagnostic details, and pharmacological data.Results: Evaluation metrics, including accuracy, precision, recall, and F1-score, revealed high efficacy across most categories, with GPT-4o achieving the highest overall performance (91.4% accuracy).Conclusions: The findings highlight notable differences in precision and recall between models, particularly in extracting names and age-related information. There were challenges in processing unstructured medical text, including variability in model performance across data types. Our findings demonstrate the feasibility of integrating LLMs into health care workflows; LLMs offer substantial improvements in data accessibility and support clinical decision-making processes. In addition, the paper describes the role of retrieval-augmented generation techniques in enhancing information retrieval accuracy, addressing issues such as hallucinations and outdated data in LLM outputs. Future work should explore the need for optimization through larger and more diverse training datasets, advanced prompting strategies, and the integration of domain-specific knowledge to improve model generalizability and precision.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e68776"},"PeriodicalIF":2.0000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12271962/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/68776","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The digital transformation of health care has introduced both opportunities and challenges, particularly in managing and analyzing the vast amounts of unstructured medical data generated daily. There is a need to explore the feasibility of generative solutions in extracting data from medical reports, categorized by specific criteria.

Objective: This study aimed to investigate the application of large language models (LLMs) for the automated extraction of structured information from unstructured medical reports, using the LangChain framework in Python.

Methods: Through a systematic evaluation of leading LLMs-GPT-4o, Llama 3, Llama 3.1, Gemma 2, Qwen 2, and Qwen 2.5-using zero-shot prompting techniques and embedding results into a vector database, this study assessed the performance of LLMs in extracting patient demographics, diagnostic details, and pharmacological data.

Results: Evaluation metrics, including accuracy, precision, recall, and F₁-score, revealed high efficacy across most categories, with GPT-4o achieving the highest overall performance (91.4% accuracy).

Conclusions: The findings highlight notable differences in precision and recall between models, particularly in extracting names and age-related information. There were challenges in processing unstructured medical text, including variability in model performance across data types. Our findings demonstrate the feasibility of integrating LLMs into health care workflows; LLMs offer substantial improvements in data accessibility and support clinical decision-making processes. In addition, the paper describes the role of retrieval-augmented generation techniques in enhancing information retrieval accuracy, addressing issues such as hallucinations and outdated data in LLM outputs. Future work should explore the need for optimization through larger and more diverse training datasets, advanced prompting strategies, and the integration of domain-specific knowledge to improve model generalizability and precision.

Abstract Image

查看原文本刊更多论文

利用大型语言模型从医疗报告中准确检索患者信息：系统评估研究。

背景：医疗保健的数字化转型带来了机遇和挑战，特别是在管理和分析每天产生的大量非结构化医疗数据方面。有必要探讨从按具体标准分类的医疗报告中提取数据的生成解决方案的可行性。目的：本研究旨在探讨大型语言模型（LLMs）在Python中使用LangChain框架从非结构化医疗报告中自动提取结构化信息的应用。方法：采用零次提示技术，并将结果嵌入到载体数据库中，对领先的LLMs- gbt - 40、Llama 3、Llama 3.1、Gemma 2、Qwen 2和Qwen 2.5进行系统评估，评估LLMs在提取患者人口统计学、诊断细节和药理学数据方面的表现。结果：评估指标，包括准确性、精密度、召回率和f1评分，显示了大多数类别的高效率，gpt - 40达到了最高的总体性能（91.4%的准确率）。结论：研究结果突出了不同模型在准确率和召回率方面的显著差异，特别是在提取姓名和年龄相关信息方面。在处理非结构化医学文本方面存在挑战，包括跨数据类型的模型性能的可变性。我们的研究结果证明了将法学硕士纳入医疗保健工作流程的可行性；法学硕士在数据可访问性和支持临床决策过程方面提供了实质性的改进。此外，本文描述了检索增强生成技术在提高信息检索准确性方面的作用，解决了LLM输出中的幻觉和过时数据等问题。未来的工作应该通过更大、更多样化的训练数据集、先进的提示策略和领域特定知识的集成来探索优化的需求，以提高模型的泛化性和精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR AI

自引率

0.00%

发文量