Implementing a Resource-Light and Low-Code Large Language Model System for Information Extraction from Mammography Reports: A Pilot Study.

Journal of imaging informatics in medicine Pub Date : 2025-09-10 DOI:10.1007/s10278-025-01659-4

Fabio Dennstädt, Simon Fauser, Nikola Cihoric, Max Schmerder, Paolo Lombardo, Grazia Maria Cereghetti, Sandro von Däniken, Thomas Minder, Jaro Meyer, Lawrence Chiang, Roberto Gaio, Luc Lerch, Irina Filchenko, Daniel Reichenpfader, Kerstin Denecke, Caslav Vojvodic, Igor Tatalovic, André Sander, Janna Hastings, Daniel M Aebersold, Hendrik von Tengg-Kobligk, Knud Nairz

{"title":"Implementing a Resource-Light and Low-Code Large Language Model System for Information Extraction from Mammography Reports: A Pilot Study.","authors":"Fabio Dennstädt, Simon Fauser, Nikola Cihoric, Max Schmerder, Paolo Lombardo, Grazia Maria Cereghetti, Sandro von Däniken, Thomas Minder, Jaro Meyer, Lawrence Chiang, Roberto Gaio, Luc Lerch, Irina Filchenko, Daniel Reichenpfader, Kerstin Denecke, Caslav Vojvodic, Igor Tatalovic, André Sander, Janna Hastings, Daniel M Aebersold, Hendrik von Tengg-Kobligk, Knud Nairz","doi":"10.1007/s10278-025-01659-4","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) have been successfully used for data extraction from free-text radiology reports. Most current studies were conducted with LLMs accessed via an application programming interface (API). We evaluated the feasibility of using open-source LLMs, deployed on limited local hardware resources for data extraction from free-text mammography reports, using a common data element (CDE)-based structure. Seventy-nine CDEs were defined by an interdisciplinary expert panel, reflecting real-world reporting practice. Sixty-one reports were classified by two independent researchers to establish ground truth. Five different open-source LLMs deployable on a single GPU were used for data extraction using the general-classifier Python package. Extractions were performed for five different prompt approaches with calculation of overall accuracy, micro-recall and micro-F1. Additional analyses were conducted using thresholds for the relative probability of classifications. High inter-rater agreement was observed between manual classifiers (Cohen's kappa 0.83). Using default prompts, the LLMs achieved accuracies of 59.2-72.9%. Chain-of-thought prompting yielded mixed results, while few-shot prompting led to decreased accuracy. Adaptation of the default prompts to precisely define classification tasks improved performance for all models, with accuracies of 64.7-85.3%. Setting certainty thresholds further improved accuracies to > 90% but reduced the coverage rate to < 50%. Locally deployed open-source LLMs can effectively extract information from mammography reports, maintaining compatibility with limited computational resources. Selection and evaluation of the model and prompting strategy are critical. Clear, task-specific instructions appear crucial for high performance. Using a CDE-based framework provides clear semantics and structure for the data extraction.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of imaging informatics in medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10278-025-01659-4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) have been successfully used for data extraction from free-text radiology reports. Most current studies were conducted with LLMs accessed via an application programming interface (API). We evaluated the feasibility of using open-source LLMs, deployed on limited local hardware resources for data extraction from free-text mammography reports, using a common data element (CDE)-based structure. Seventy-nine CDEs were defined by an interdisciplinary expert panel, reflecting real-world reporting practice. Sixty-one reports were classified by two independent researchers to establish ground truth. Five different open-source LLMs deployable on a single GPU were used for data extraction using the general-classifier Python package. Extractions were performed for five different prompt approaches with calculation of overall accuracy, micro-recall and micro-F1. Additional analyses were conducted using thresholds for the relative probability of classifications. High inter-rater agreement was observed between manual classifiers (Cohen's kappa 0.83). Using default prompts, the LLMs achieved accuracies of 59.2-72.9%. Chain-of-thought prompting yielded mixed results, while few-shot prompting led to decreased accuracy. Adaptation of the default prompts to precisely define classification tasks improved performance for all models, with accuracies of 64.7-85.3%. Setting certainty thresholds further improved accuracies to > 90% but reduced the coverage rate to < 50%. Locally deployed open-source LLMs can effectively extract information from mammography reports, maintaining compatibility with limited computational resources. Selection and evaluation of the model and prompting strategy are critical. Clear, task-specific instructions appear crucial for high performance. Using a CDE-based framework provides clear semantics and structure for the data extraction.

查看原文本刊更多论文

实现一个轻资源和低代码的大语言模型系统，用于从乳房x光检查报告中提取信息：一项试点研究。

大型语言模型（llm）已经成功地用于从自由文本放射学报告中提取数据。目前大多数研究都是通过应用程序编程接口（API）访问llm进行的。我们评估了使用开源llm的可行性，部署在有限的本地硬件资源上，使用基于公共数据元素（CDE）的结构从自由文本乳房x线检查报告中提取数据。79个cde由一个跨学科专家小组定义，反映了现实世界的报告实践。两名独立研究人员对61份报告进行了分类，以确定基本事实。使用通用分类器Python包，使用可部署在单个GPU上的五个不同的开源llm进行数据提取。对五种不同的提示方法进行提取，并计算总体准确率、微召回率和微f1。使用分类相对概率的阈值进行了额外的分析。人工分类器之间高度一致（Cohen’s kappa 0.83）。使用默认提示，llm的准确率达到59.2-72.9%。思维链提示产生的结果好坏参半，而较少的提示导致准确性下降。调整默认提示以精确定义分类任务，提高了所有模型的性能，准确率为64.7-85.3%。设置确定性阈值进一步将准确率提高到90%，但将覆盖率降低到

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of imaging informatics in medicine

自引率

0.00%

发文量