Leveraging open-source large language models for clinical information extraction in resource-constrained settings.

IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES

JAMIA Open Pub Date : 2025-10-01 DOI:10.1093/jamiaopen/ooaf109

Luc Builtjes, Joeran Bosma, Mathias Prokop, Bram van Ginneken, Alessa Hering

{"title":"Leveraging open-source large language models for clinical information extraction in resource-constrained settings.","authors":"Luc Builtjes, Joeran Bosma, Mathias Prokop, Bram van Ginneken, Alessa Hering","doi":"10.1093/jamiaopen/ooaf109","DOIUrl":null,"url":null,"abstract":"Objective: We aimed to evaluate the zero-shot performance of open-source generative large language models (LLMs) on clinical information extraction from Dutch medical reports using the Diagnostic Report Analysis: General Optimization of NLP (DRAGON) benchmark.Methods: We developed and released the llm_extractinator framework, a scalable, open-source tool for automating information extraction from clinical texts using LLMs. We evaluated 9 multilingual open-source LLMs across 28 tasks in the DRAGON benchmark, covering classification, regression, and named entity recognition (NER). All tasks were performed in a zero-shot setting. Model performance was quantified using task-specific metrics and aggregated into a DRAGON utility score. Additionally, we investigated the effect of in-context translation to English.Results: Llama-3.3-70B achieved the highest utility score (0.760), followed by Phi-4-14B (0.751), Qwen-2.5-14B (0.748), and DeepSeek-R1-14B (0.744). These models outperformed or matched a fine-tuned RoBERTa baseline on 17 of 28 tasks, particularly in regression and structured classification. NER performance was consistently low across all models. Translation to English consistently reduced performance.Discussion: Generative LLMs demonstrated strong zero-shot capabilities on clinical natural language processing tasks involving structured inference. Models around 14B parameters performed well overall, with Llama-3.3-70B leading but at high computational cost. Generative models excelled in regression tasks, but were hindered by token-level output formats for NER. Translation to English reduced performance, emphasizing the need for native language support.Conclusion: Open-source generative LLMs provide a viable zero-shot alternative for clinical information extraction from Dutch medical texts, particularly in low-resource and multilingual settings.","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 5","pages":"ooaf109"},"PeriodicalIF":3.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12488231/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooaf109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: We aimed to evaluate the zero-shot performance of open-source generative large language models (LLMs) on clinical information extraction from Dutch medical reports using the Diagnostic Report Analysis: General Optimization of NLP (DRAGON) benchmark.

Methods: We developed and released the llm_extractinator framework, a scalable, open-source tool for automating information extraction from clinical texts using LLMs. We evaluated 9 multilingual open-source LLMs across 28 tasks in the DRAGON benchmark, covering classification, regression, and named entity recognition (NER). All tasks were performed in a zero-shot setting. Model performance was quantified using task-specific metrics and aggregated into a DRAGON utility score. Additionally, we investigated the effect of in-context translation to English.

Results: Llama-3.3-70B achieved the highest utility score (0.760), followed by Phi-4-14B (0.751), Qwen-2.5-14B (0.748), and DeepSeek-R1-14B (0.744). These models outperformed or matched a fine-tuned RoBERTa baseline on 17 of 28 tasks, particularly in regression and structured classification. NER performance was consistently low across all models. Translation to English consistently reduced performance.

Discussion: Generative LLMs demonstrated strong zero-shot capabilities on clinical natural language processing tasks involving structured inference. Models around 14B parameters performed well overall, with Llama-3.3-70B leading but at high computational cost. Generative models excelled in regression tasks, but were hindered by token-level output formats for NER. Translation to English reduced performance, emphasizing the need for native language support.

Conclusion: Open-source generative LLMs provide a viable zero-shot alternative for clinical information extraction from Dutch medical texts, particularly in low-resource and multilingual settings.

查看原文本刊更多论文

利用开源大型语言模型在资源受限的环境中提取临床信息。

目的：我们旨在使用诊断报告分析：NLP的一般优化（DRAGON）基准来评估开源生成大语言模型（LLMs）在从荷兰医学报告中提取临床信息方面的零射击性能。方法：我们开发并发布了llm_extractinator框架，这是一个可扩展的开源工具，用于使用llm从临床文本中自动提取信息。我们在DRAGON基准测试中评估了跨28个任务的9个多语言开源llm，涵盖分类、回归和命名实体识别（NER）。所有的任务都是在零射击的情况下完成的。使用特定于任务的度量对模型性能进行量化，并汇总到DRAGON效用评分中。此外，我们还研究了语境翻译对英语的影响。结果：Llama-3.3-70B的效用得分最高（0.760），其次是Phi-4-14B（0.751）、qwen2 .5- 14b（0.748）和DeepSeek-R1-14B（0.744）。这些模型在28项任务中的17项上优于或匹配微调的RoBERTa基线，特别是在回归和结构化分类方面。在所有模型中，NER的性能一直很低。翻译成英语会持续降低表现。讨论：生成法学硕士在涉及结构化推理的临床自然语言处理任务中表现出强大的零射击能力。14B参数附近的模型总体表现良好，其中羊驼-3.3- 70b领先，但计算成本较高。生成模型在回归任务中表现出色，但受到NER的令牌级输出格式的阻碍。翻译成英语降低了性能，强调需要母语支持。结论：开源生成法学硕士为从荷兰医学文本中提取临床信息提供了一个可行的零射击替代方案，特别是在低资源和多语言环境中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊