{"title":"From text to data: Open-source large language models in extracting cancer related medical attributes from German pathology reports","authors":"Stefan Bartels, Jasmin Carus","doi":"10.1016/j.ijmedinf.2025.106022","DOIUrl":null,"url":null,"abstract":"<div><div>Structured oncological documentation is vital for data-driven cancer care, yet extracting clinical features from unstructured pathology reports remains challenging—especially in German healthcare, where strict data protection rules require local model deployment. This study evaluates open-source large language models (LLMs) for extracting oncological attributes from German pathology reports in a secure, on-premise setting. We created a gold-standard dataset of 522 annotated reports and developed a retrieval-augmented generation (RAG) pipeline using an additional 15,000 pathology reports. Five instruction-tuned LLMs (Llama 3.3 70B, Mistral Small 24B, and three SauerkrautLM variants) were evaluated using three prompting strategies: zero-shot, few-shot, and RAG-enhanced few-shot prompting. All models produced structured JSON outputs and were assessed using entity-level precision, recall, accuracy, and macro-averaged F1-score. Results show that Llama 3.3 70B achieved the highest overall performance (F1 > 0.90). However, when combined with the RAG pipeline, Mistral Small 24B achieved nearly equivalent performance, matching Llama 70B on most entity types while requiring significantly fewer computational resources. Prompting strategy significantly impacted performance: few-shot prompting improved baseline accuracy, and RAG further enhanced performance, particularly for models with fewer than 24B parameters. Challenges remained in extracting less frequent but clinically critical attributes like metastasis and staging, underscoring the importance of retrieval mechanisms and balanced training data. This study demonstrates that open-source LLMs, when paired with effective prompting and retrieval strategies, can enable high-quality, privacy-compliant extraction of oncological information from unstructured text. The finding that smaller models can match larger ones through retrieval augmentation highlights a path toward scalable, resource-efficient deployment in German clinical settings.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"203 ","pages":"Article 106022"},"PeriodicalIF":4.1000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505625002394","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Structured oncological documentation is vital for data-driven cancer care, yet extracting clinical features from unstructured pathology reports remains challenging—especially in German healthcare, where strict data protection rules require local model deployment. This study evaluates open-source large language models (LLMs) for extracting oncological attributes from German pathology reports in a secure, on-premise setting. We created a gold-standard dataset of 522 annotated reports and developed a retrieval-augmented generation (RAG) pipeline using an additional 15,000 pathology reports. Five instruction-tuned LLMs (Llama 3.3 70B, Mistral Small 24B, and three SauerkrautLM variants) were evaluated using three prompting strategies: zero-shot, few-shot, and RAG-enhanced few-shot prompting. All models produced structured JSON outputs and were assessed using entity-level precision, recall, accuracy, and macro-averaged F1-score. Results show that Llama 3.3 70B achieved the highest overall performance (F1 > 0.90). However, when combined with the RAG pipeline, Mistral Small 24B achieved nearly equivalent performance, matching Llama 70B on most entity types while requiring significantly fewer computational resources. Prompting strategy significantly impacted performance: few-shot prompting improved baseline accuracy, and RAG further enhanced performance, particularly for models with fewer than 24B parameters. Challenges remained in extracting less frequent but clinically critical attributes like metastasis and staging, underscoring the importance of retrieval mechanisms and balanced training data. This study demonstrates that open-source LLMs, when paired with effective prompting and retrieval strategies, can enable high-quality, privacy-compliant extraction of oncological information from unstructured text. The finding that smaller models can match larger ones through retrieval augmentation highlights a path toward scalable, resource-efficient deployment in German clinical settings.
期刊介绍:
International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings.
The scope of journal covers:
Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.;
Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc.
Educational computer based programs pertaining to medical informatics or medicine in general;
Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.