Large language models for error detection in radiology reports: a comparative analysis between closed-source and privacy-compliant open-source models.

IF 4.7 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

European Radiology Pub Date : 2025-08-01 Epub Date: 2025-02-20 DOI:10.1007/s00330-025-11438-y

Babak Salam, Claire Stüwe, Sebastian Nowak, Alois M Sprinkart, Maike Theis, Dmitrij Kravchenko, Narine Mesropyan, Tatjana Dell, Christoph Endler, Claus C Pieper, Daniel L Kuetting, Julian A Luetkens, Alexander Isaak

{"title":"Large language models for error detection in radiology reports: a comparative analysis between closed-source and privacy-compliant open-source models.","authors":"Babak Salam, Claire Stüwe, Sebastian Nowak, Alois M Sprinkart, Maike Theis, Dmitrij Kravchenko, Narine Mesropyan, Tatjana Dell, Christoph Endler, Claus C Pieper, Daniel L Kuetting, Julian A Luetkens, Alexander Isaak","doi":"10.1007/s00330-025-11438-y","DOIUrl":null,"url":null,"abstract":"Purpose: Large language models (LLMs) like Generative Pre-trained Transformer 4 (GPT-4) can assist in detecting errors in radiology reports, but privacy concerns limit their clinical applicability. This study compares closed-source and privacy-compliant open-source LLMs for detecting common errors in radiology reports.Materials and methods: A total of 120 radiology reports were compiled (30 each from X-ray, ultrasound, CT, and MRI). Subsequently, 397 errors from five categories (typographical, numerical, findings-impression discrepancies, omission/insertion, interpretation) were inserted into 100 of these reports; 20 reports were left unchanged. Two open-source models (Llama 3-70b, Mixtral 8x22b) and two commercial closed-source (GPT-4, GPT-4o) were tasked with error detection using identical prompts. The Kruskall-Wallis test and paired t-test were used for statistical analysis.Results: Open-source LLMs required less processing time per radiology report than closed-source LLMs (6 ± 2 s vs. 13 ± 4 s; p < 0.001). Closed-source LLMs achieved higher error detection rates than open-source LLMs (GPT-4o: 88% [348/397; 95% CI: 86, 92], GPT-4: 83% [328/397; 95% CI: 80, 87], Llama 3-70b: 79% [311/397; 95% CI: 76, 83], Mixtral 8x22b: 73% [288/397; 95% CI: 68, 77]; p < 0.001). Numerical errors (88% [67/76; 95% CI: 82, 93]) were detected significantly more often than typographical errors (75% [65/86; 95% CI: 68, 82]; p = 0.02), discrepancies between findings and impression (73% [73/101; 95% CI: 67, 80]; p < 0.01), and interpretation errors (70% [50/71; 95% CI: 62, 78]; p = 0.001).Conclusion: Open-source LLMs demonstrated effective error detection, albeit with comparatively lower accuracy than commercial closed-source models, and have potential for clinical applications when deployed via privacy-compliant local hosting solutions.Key points: Question Can privacy-compliant open-source large language models (LLMs) match the error-detection performance of commercial non-privacy-compliant closed-source models in radiology reports? Findings Closed-source LLMs achieved slightly higher accuracy in detecting radiology report errors than open-source models, with Llama 3-70b yielding the best results among the open-source models. Clinical relevance Open-source LLMs offer a privacy-compliant alternative for automated error detection in radiology reports, improving clinical workflow efficiency while ensuring patient data confidentiality. Further refinement could enhance their accuracy, contributing to better diagnosis and patient care.","PeriodicalId":12076,"journal":{"name":"European Radiology","volume":" ","pages":"4549-4557"},"PeriodicalIF":4.7000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12226608/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00330-025-11438-y","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/20 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Large language models (LLMs) like Generative Pre-trained Transformer 4 (GPT-4) can assist in detecting errors in radiology reports, but privacy concerns limit their clinical applicability. This study compares closed-source and privacy-compliant open-source LLMs for detecting common errors in radiology reports.

Materials and methods: A total of 120 radiology reports were compiled (30 each from X-ray, ultrasound, CT, and MRI). Subsequently, 397 errors from five categories (typographical, numerical, findings-impression discrepancies, omission/insertion, interpretation) were inserted into 100 of these reports; 20 reports were left unchanged. Two open-source models (Llama 3-70b, Mixtral 8x22b) and two commercial closed-source (GPT-4, GPT-4o) were tasked with error detection using identical prompts. The Kruskall-Wallis test and paired t-test were used for statistical analysis.

Results: Open-source LLMs required less processing time per radiology report than closed-source LLMs (6 ± 2 s vs. 13 ± 4 s; p < 0.001). Closed-source LLMs achieved higher error detection rates than open-source LLMs (GPT-4o: 88% [348/397; 95% CI: 86, 92], GPT-4: 83% [328/397; 95% CI: 80, 87], Llama 3-70b: 79% [311/397; 95% CI: 76, 83], Mixtral 8x22b: 73% [288/397; 95% CI: 68, 77]; p < 0.001). Numerical errors (88% [67/76; 95% CI: 82, 93]) were detected significantly more often than typographical errors (75% [65/86; 95% CI: 68, 82]; p = 0.02), discrepancies between findings and impression (73% [73/101; 95% CI: 67, 80]; p < 0.01), and interpretation errors (70% [50/71; 95% CI: 62, 78]; p = 0.001).

Conclusion: Open-source LLMs demonstrated effective error detection, albeit with comparatively lower accuracy than commercial closed-source models, and have potential for clinical applications when deployed via privacy-compliant local hosting solutions.

Key points: Question Can privacy-compliant open-source large language models (LLMs) match the error-detection performance of commercial non-privacy-compliant closed-source models in radiology reports? Findings Closed-source LLMs achieved slightly higher accuracy in detecting radiology report errors than open-source models, with Llama 3-70b yielding the best results among the open-source models. Clinical relevance Open-source LLMs offer a privacy-compliant alternative for automated error detection in radiology reports, improving clinical workflow efficiency while ensuring patient data confidentiality. Further refinement could enhance their accuracy, contributing to better diagnosis and patient care.

查看原文本刊更多论文

用于放射学报告错误检测的大型语言模型：闭源模型与符合隐私要求的开源模型之间的比较分析。

目的：大型语言模型（llm），如生成预训练变压器4 （GPT-4）可以帮助检测放射学报告中的错误，但隐私问题限制了它们的临床适用性。本研究比较了闭源和符合隐私的开源llm用于检测放射学报告中的常见错误。材料与方法：共收集120份影像学报告（x线、超声、CT、MRI各30份）。随后，在其中100份报告中插入了5类（印刷、数字、调查结果-印象差异、遗漏/插入、解释）的397个错误；20份报告保持不变。两个开源模型（Llama 3-70b, Mixtral 8x22b）和两个商业闭源模型（GPT-4, gpt - 40）使用相同的提示进行错误检测。采用Kruskall-Wallis检验和配对t检验进行统计分析。结果：与闭源LLMs相比，开源LLMs每份放射学报告所需的处理时间更短（6±2秒vs. 13±4秒）；p结论：开源llm证明了有效的错误检测，尽管其准确性相对低于商业闭源模型，并且在通过符合隐私的本地托管解决方案部署时具有临床应用的潜力。符合隐私的开源大语言模型（llm）在放射学报告中的错误检测性能能否与商业不符合隐私的闭源模型相匹配？发现闭源LLMs检测放射学报告错误的准确率略高于开源模型，其中Llama 3-70b在开源模型中效果最好。开源法学硕士为放射学报告中的自动错误检测提供了符合隐私的替代方案，提高了临床工作流程效率，同时确保了患者数据的保密性。进一步的改进可以提高它们的准确性，有助于更好的诊断和病人护理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

European Radiology 医学-核医学

CiteScore

11.60

自引率

8.50%

发文量

874

审稿时长

2-4 weeks

期刊介绍： European Radiology (ER) continuously updates scientific knowledge in radiology by publication of strong original articles and state-of-the-art reviews written by leading radiologists. A well balanced combination of review articles, original papers, short communications from European radiological congresses and information on society matters makes ER an indispensable source for current information in this field. This is the Journal of the European Society of Radiology, and the official journal of a number of societies. From 2004-2008 supplements to European Radiology were published under its companion, European Radiology Supplements, ISSN 1613-3749.