Medication information extraction using local large language models

IF 4.5 2区医学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Biomedical Informatics Pub Date : 2025-08-21 DOI:10.1016/j.jbi.2025.104898

Phillip Richter-Pechanski , Marvin Seiferling , Christina Kiriakou , Dominic M. Schwab , Nicolas A. Geis , Christoph Dieterich , Anette Frank

{"title":"Medication information extraction using local large language models","authors":"Phillip Richter-Pechanski , Marvin Seiferling , Christina Kiriakou , Dominic M. Schwab , Nicolas A. Geis , Christoph Dieterich , Anette Frank","doi":"10.1016/j.jbi.2025.104898","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>Medication information is crucial for clinical routine and research. However, a vast amount is stored in unstructured text, such as doctor’s letters, requiring manual extraction – a resource-intensive, error-prone task. Automating this process comes with significant constraints in a clinical setup, including the demand for clinical expertise, limited time-resources, restricted IT infrastructure, and the demand for transparent predictions. Recent advances in generative large language models (LLMs) and parameter-efficient fine-tuning methods show potential to address these challenges.</div></div><div><h3>Methods</h3><div>We evaluated local LLMs for end-to-end extraction of medication information, combining named entity recognition and relation extraction. We used format-restricting instructions and developed an innovative feedback pipeline to facilitate automated evaluation. We applied token-level Shapley values to visualize and quantify token contributions, to improve transparency of model predictions.</div></div><div><h3>Results</h3><div>Two open-source LLMs – one general (Llama) and one domain-specific (OpenBioLLM) – were evaluated on the English n2c2 2018 corpus and the German CARDIO:DE corpus. OpenBioLLM frequently struggled with structured outputs and hallucinations. Fine-tuned Llama models achieved new state-of-the-art results, improving F1-score by up to 10 percentage points (pp.) for adverse drug events and 6 pp. for medication reasons on English data. On the German dataset, Llama established a new benchmark, outperforming traditional machine learning methods by up to 16 pp. micro average F1-score.</div></div><div><h3>Conclusion</h3><div>Our findings show that fine-tuned local open-source generative LLMs outperform SOTA methods for medication information extraction, delivering high performance with limited time and IT resources in a real-world clinical setup, and demonstrate their effectiveness on both English and German data. Applying Shapley values improved prediction transparency, supporting informed clinical decision-making.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"169 ","pages":"Article 104898"},"PeriodicalIF":4.5000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1532046425001273","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Objective

Medication information is crucial for clinical routine and research. However, a vast amount is stored in unstructured text, such as doctor’s letters, requiring manual extraction – a resource-intensive, error-prone task. Automating this process comes with significant constraints in a clinical setup, including the demand for clinical expertise, limited time-resources, restricted IT infrastructure, and the demand for transparent predictions. Recent advances in generative large language models (LLMs) and parameter-efficient fine-tuning methods show potential to address these challenges.

Methods

We evaluated local LLMs for end-to-end extraction of medication information, combining named entity recognition and relation extraction. We used format-restricting instructions and developed an innovative feedback pipeline to facilitate automated evaluation. We applied token-level Shapley values to visualize and quantify token contributions, to improve transparency of model predictions.

Results

Two open-source LLMs – one general (Llama) and one domain-specific (OpenBioLLM) – were evaluated on the English n2c2 2018 corpus and the German CARDIO:DE corpus. OpenBioLLM frequently struggled with structured outputs and hallucinations. Fine-tuned Llama models achieved new state-of-the-art results, improving F1-score by up to 10 percentage points (pp.) for adverse drug events and 6 pp. for medication reasons on English data. On the German dataset, Llama established a new benchmark, outperforming traditional machine learning methods by up to 16 pp. micro average F1-score.

Conclusion

Our findings show that fine-tuned local open-source generative LLMs outperform SOTA methods for medication information extraction, delivering high performance with limited time and IT resources in a real-world clinical setup, and demonstrate their effectiveness on both English and German data. Applying Shapley values improved prediction transparency, supporting informed clinical decision-making.

Abstract Image

查看原文本刊更多论文

基于局部大语言模型的药物信息提取

目的用药信息是临床常规和研究的重要内容。然而，大量的数据存储在非结构化文本中，比如医生的信件，需要人工提取——这是一项资源密集、容易出错的任务。在临床设置中，自动化这一过程有很大的限制，包括对临床专业知识的需求、有限的时间资源、有限的IT基础设施以及对透明预测的需求。生成式大型语言模型（llm）和参数高效微调方法的最新进展显示出解决这些挑战的潜力。方法采用命名实体识别和关联提取相结合的方法，对局部llm进行端到端药物信息提取。我们使用了格式限制指令，并开发了一个创新的反馈管道来促进自动化评估。我们应用令牌级Shapley值来可视化和量化令牌贡献，以提高模型预测的透明度。结果在英语n2c2 2018语料库和德语CARDIO:DE语料库上对通用法学硕士（Llama）和领域法学硕士（OpenBioLLM）两个开源法学硕士进行了评估。OpenBioLLM经常与结构化输出和幻觉作斗争。微调的羊驼模型取得了最新的结果，在英语数据中，药物不良事件的f1得分提高了10个百分点，药物原因的f1得分提高了6个百分点。在德国数据集上，Llama建立了一个新的基准，比传统的机器学习方法高出16分，微平均f1分。我们的研究结果表明，经过微调的本地开源生成法学硕士在药物信息提取方面优于SOTA方法，在现实世界的临床设置中，在有限的时间和IT资源下提供了高性能，并证明了它们在英语和德语数据上的有效性。应用Shapley值可提高预测透明度，支持知情的临床决策。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Biomedical Informatics 医学-计算机：跨学科应用

CiteScore

8.90

自引率

6.70%

发文量

243

审稿时长

32 days

期刊介绍： The Journal of Biomedical Informatics reflects a commitment to high-quality original research papers, reviews, and commentaries in the area of biomedical informatics methodology. Although we publish articles motivated by applications in the biomedical sciences (for example, clinical medicine, health care, population health, and translational bioinformatics), the journal emphasizes reports of new methodologies and techniques that have general applicability and that form the basis for the evolving science of biomedical informatics. Articles on medical devices; evaluations of implemented systems (including clinical trials of information technologies); or papers that provide insight into a biological process, a specific disease, or treatment options would generally be more suitable for publication in other venues. Papers on applications of signal processing and image analysis are often more suitable for biomedical engineering journals or other informatics journals, although we do publish papers that emphasize the information management and knowledge representation/modeling issues that arise in the storage and use of biological signals and images. System descriptions are welcome if they illustrate and substantiate the underlying methodology that is the principal focus of the report and an effort is made to address the generalizability and/or range of application of that methodology. Note also that, given the international nature of JBI, papers that deal with specific languages other than English, or with country-specific health systems or approaches, are acceptable for JBI only if they offer generalizable lessons that are relevant to the broad JBI readership, regardless of their country, language, culture, or health system.