Can open source large language models be used for tumor documentation in Germany?-An evaluation on urological doctors' notes.

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Biodata Mining Pub Date : 2025-07-24 DOI:10.1186/s13040-025-00463-8

Stefan Lenz, Arsenij Ustjanzew, Marco Jeray, Meike Ressing, Torsten Panholzer

{"title":"Can open source large language models be used for tumor documentation in Germany?-An evaluation on urological doctors' notes.","authors":"Stefan Lenz, Arsenij Ustjanzew, Marco Jeray, Meike Ressing, Torsten Panholzer","doi":"10.1186/s13040-025-00463-8","DOIUrl":null,"url":null,"abstract":"Background: Tumor documentation in Germany is currently a largely manual process. It involves reading the textual patient documentation and filling in forms in dedicated databases to obtain structured data. Advances in information extraction techniques that build on large language models (LLMs) could have the potential for enhancing the efficiency and reliability of this process. Evaluating LLMs in the German medical domain, especially their ability to interpret specialized language, is essential to determine their suitability for the use in clinical documentation. Due to data protection regulations, only locally deployed open source LLMs are generally suitable for this application.Methods: The evaluation employs eleven different open source LLMs with sizes ranging from 1 to 70 billion model parameters. Three basic tasks were selected as representative examples for the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors' notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general.Results: The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks. Models with less extensive training data or having fewer than 7 billion parameters showed notably lower performance, while larger models did not display performance gains. Examples from a different medical domain than urology could also improve the outcome in few-shot prompting, which demonstrates the ability of LLMs to handle tasks needed for tumor documentation.Conclusions: Open source LLMs show a strong potential for automating tumor documentation. Models from 7-12 billion parameters could offer an optimal balance between performance and resource efficiency. With tailored fine-tuning and well-designed prompting, these models might become important tools for clinical documentation in the future. The code for the evaluation is available from https://github.com/stefan-m-lenz/UroLlmEval . We also release the data set under https://huggingface.co/datasets/stefan-m-lenz/UroLlmEvalSet providing a valuable resource that addresses the shortage of authentic and easily accessible benchmarks in German-language medical NLP.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"18 1","pages":"48"},"PeriodicalIF":6.1000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12291363/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-025-00463-8","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Tumor documentation in Germany is currently a largely manual process. It involves reading the textual patient documentation and filling in forms in dedicated databases to obtain structured data. Advances in information extraction techniques that build on large language models (LLMs) could have the potential for enhancing the efficiency and reliability of this process. Evaluating LLMs in the German medical domain, especially their ability to interpret specialized language, is essential to determine their suitability for the use in clinical documentation. Due to data protection regulations, only locally deployed open source LLMs are generally suitable for this application.

Methods: The evaluation employs eleven different open source LLMs with sizes ranging from 1 to 70 billion model parameters. Three basic tasks were selected as representative examples for the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors' notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general.

Results: The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks. Models with less extensive training data or having fewer than 7 billion parameters showed notably lower performance, while larger models did not display performance gains. Examples from a different medical domain than urology could also improve the outcome in few-shot prompting, which demonstrates the ability of LLMs to handle tasks needed for tumor documentation.

Conclusions: Open source LLMs show a strong potential for automating tumor documentation. Models from 7-12 billion parameters could offer an optimal balance between performance and resource efficiency. With tailored fine-tuning and well-designed prompting, these models might become important tools for clinical documentation in the future. The code for the evaluation is available from https://github.com/stefan-m-lenz/UroLlmEval . We also release the data set under https://huggingface.co/datasets/stefan-m-lenz/UroLlmEvalSet providing a valuable resource that addresses the shortage of authentic and easily accessible benchmarks in German-language medical NLP.

查看原文本刊更多论文

德国的肿瘤文档可以使用开源的大型语言模型吗？——对泌尿科医生病历的评价。

背景：目前德国的肿瘤文献记录主要是手工处理。它包括阅读患者文本文档，并在专用数据库中填写表格，以获得结构化数据。建立在大型语言模型（llm）上的信息提取技术的进步有可能提高这一过程的效率和可靠性。评估德国医学领域的法学硕士，特别是他们解释专业语言的能力，对于确定他们在临床文件中使用的适用性至关重要。由于数据保护规定，通常只有本地部署的开源llm才适合此应用程序。方法：采用11种不同的开源llm进行评估，模型参数从1亿个到700亿个不等。选择三个基本任务作为肿瘤记录过程的代表性示例：识别肿瘤诊断，分配ICD-10代码和提取首次诊断日期。为了评估llm在这些任务上的表现，我们准备了一个基于匿名泌尿科医生笔记的带注释的文本片段数据集。采用不同的提示策略，考察了实例数量对少射提示的影响，并探讨了llm的总体能力。结果：羊驼3.1 8B、西北风7B和西北风NeMo 12b模型在任务中的表现相当好。训练数据较少或参数少于70亿个的模型表现出明显较低的性能，而较大的模型则没有表现出性能提升。与泌尿外科不同的医学领域的例子也可以在少量注射提示中改善结果，这表明llm有能力处理肿瘤记录所需的任务。结论：开源llm显示了自动化肿瘤文档的强大潜力。70 - 120亿个参数的模型可以在性能和资源效率之间提供最佳平衡。通过量身定制的微调和精心设计的提示，这些模型可能成为未来临床记录的重要工具。求值的代码可从https://github.com/stefan-m-lenz/UroLlmEval获得。我们还在https://huggingface.co/datasets/stefan-m-lenz/UroLlmEvalSet下发布了数据集，提供了宝贵的资源，解决了德语医学NLP中真实且易于获取的基准的短缺问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.