{"title":"利用开源大型语言模型在资源受限的环境中提取临床信息。","authors":"Luc Builtjes, Joeran Bosma, Mathias Prokop, Bram van Ginneken, Alessa Hering","doi":"10.1093/jamiaopen/ooaf109","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>We aimed to evaluate the zero-shot performance of open-source generative large language models (LLMs) on clinical information extraction from Dutch medical reports using the Diagnostic Report Analysis: General Optimization of NLP (DRAGON) benchmark.</p><p><strong>Methods: </strong>We developed and released the llm_extractinator framework, a scalable, open-source tool for automating information extraction from clinical texts using LLMs. We evaluated 9 multilingual open-source LLMs across 28 tasks in the DRAGON benchmark, covering classification, regression, and named entity recognition (NER). All tasks were performed in a zero-shot setting. Model performance was quantified using task-specific metrics and aggregated into a DRAGON utility score. Additionally, we investigated the effect of in-context translation to English.</p><p><strong>Results: </strong>Llama-3.3-70B achieved the highest utility score (0.760), followed by Phi-4-14B (0.751), Qwen-2.5-14B (0.748), and DeepSeek-R1-14B (0.744). These models outperformed or matched a fine-tuned RoBERTa baseline on 17 of 28 tasks, particularly in regression and structured classification. NER performance was consistently low across all models. Translation to English consistently reduced performance.</p><p><strong>Discussion: </strong>Generative LLMs demonstrated strong zero-shot capabilities on clinical natural language processing tasks involving structured inference. Models around 14B parameters performed well overall, with Llama-3.3-70B leading but at high computational cost. Generative models excelled in regression tasks, but were hindered by token-level output formats for NER. Translation to English reduced performance, emphasizing the need for native language support.</p><p><strong>Conclusion: </strong>Open-source generative LLMs provide a viable zero-shot alternative for clinical information extraction from Dutch medical texts, particularly in low-resource and multilingual settings.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 5","pages":"ooaf109"},"PeriodicalIF":3.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12488231/pdf/","citationCount":"0","resultStr":"{\"title\":\"Leveraging open-source large language models for clinical information extraction in resource-constrained settings.\",\"authors\":\"Luc Builtjes, Joeran Bosma, Mathias Prokop, Bram van Ginneken, Alessa Hering\",\"doi\":\"10.1093/jamiaopen/ooaf109\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>We aimed to evaluate the zero-shot performance of open-source generative large language models (LLMs) on clinical information extraction from Dutch medical reports using the Diagnostic Report Analysis: General Optimization of NLP (DRAGON) benchmark.</p><p><strong>Methods: </strong>We developed and released the llm_extractinator framework, a scalable, open-source tool for automating information extraction from clinical texts using LLMs. We evaluated 9 multilingual open-source LLMs across 28 tasks in the DRAGON benchmark, covering classification, regression, and named entity recognition (NER). All tasks were performed in a zero-shot setting. Model performance was quantified using task-specific metrics and aggregated into a DRAGON utility score. Additionally, we investigated the effect of in-context translation to English.</p><p><strong>Results: </strong>Llama-3.3-70B achieved the highest utility score (0.760), followed by Phi-4-14B (0.751), Qwen-2.5-14B (0.748), and DeepSeek-R1-14B (0.744). These models outperformed or matched a fine-tuned RoBERTa baseline on 17 of 28 tasks, particularly in regression and structured classification. NER performance was consistently low across all models. Translation to English consistently reduced performance.</p><p><strong>Discussion: </strong>Generative LLMs demonstrated strong zero-shot capabilities on clinical natural language processing tasks involving structured inference. Models around 14B parameters performed well overall, with Llama-3.3-70B leading but at high computational cost. Generative models excelled in regression tasks, but were hindered by token-level output formats for NER. Translation to English reduced performance, emphasizing the need for native language support.</p><p><strong>Conclusion: </strong>Open-source generative LLMs provide a viable zero-shot alternative for clinical information extraction from Dutch medical texts, particularly in low-resource and multilingual settings.</p>\",\"PeriodicalId\":36278,\"journal\":{\"name\":\"JAMIA Open\",\"volume\":\"8 5\",\"pages\":\"ooaf109\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12488231/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JAMIA Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/jamiaopen/ooaf109\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooaf109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Leveraging open-source large language models for clinical information extraction in resource-constrained settings.
Objective: We aimed to evaluate the zero-shot performance of open-source generative large language models (LLMs) on clinical information extraction from Dutch medical reports using the Diagnostic Report Analysis: General Optimization of NLP (DRAGON) benchmark.
Methods: We developed and released the llm_extractinator framework, a scalable, open-source tool for automating information extraction from clinical texts using LLMs. We evaluated 9 multilingual open-source LLMs across 28 tasks in the DRAGON benchmark, covering classification, regression, and named entity recognition (NER). All tasks were performed in a zero-shot setting. Model performance was quantified using task-specific metrics and aggregated into a DRAGON utility score. Additionally, we investigated the effect of in-context translation to English.
Results: Llama-3.3-70B achieved the highest utility score (0.760), followed by Phi-4-14B (0.751), Qwen-2.5-14B (0.748), and DeepSeek-R1-14B (0.744). These models outperformed or matched a fine-tuned RoBERTa baseline on 17 of 28 tasks, particularly in regression and structured classification. NER performance was consistently low across all models. Translation to English consistently reduced performance.
Discussion: Generative LLMs demonstrated strong zero-shot capabilities on clinical natural language processing tasks involving structured inference. Models around 14B parameters performed well overall, with Llama-3.3-70B leading but at high computational cost. Generative models excelled in regression tasks, but were hindered by token-level output formats for NER. Translation to English reduced performance, emphasizing the need for native language support.
Conclusion: Open-source generative LLMs provide a viable zero-shot alternative for clinical information extraction from Dutch medical texts, particularly in low-resource and multilingual settings.