Yaara Artsi , Eyal Klang , Jeremy D. Collins , Benjamin S. Glicksberg , Girish N. Nadkarni , Panagiotis Korfiatis , Vera Sorin
{"title":"Large language models in radiology reporting - A systematic review of performance, limitations, and clinical implications","authors":"Yaara Artsi , Eyal Klang , Jeremy D. Collins , Benjamin S. Glicksberg , Girish N. Nadkarni , Panagiotis Korfiatis , Vera Sorin","doi":"10.1016/j.ibmed.2025.100287","DOIUrl":null,"url":null,"abstract":"<div><h3>Rationale and objectives</h3><div>Large language models (LLMs) and vision-language models (VLMs), have emerged as potential tools for automated radiology reporting. However, concerns regarding their fidelity, reliability, and clinical applicability remain. This systematic review examines the current literature on LLM-generated radiology reports. Assessing their fidelity, clinical reliability, and effectiveness. The review aims to identify benefits, limitations, and key factors influencing AI-generated report quality.</div></div><div><h3>Materials and methods</h3><div>We conducted a systematic search of MEDLINE, Google Scholar, Scopus, and Web of Science to identify studies published between January 2015 and July 2025. Studies evaluating VLM/LLM-generated radiology reports were included (Transformer-based generative large language models). The study follows PRISMA guidelines. Risk of bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool.</div></div><div><h3>Results</h3><div>Fifteen studies met the inclusion criteria. Four assessed VLMs that generate full radiology reports directly from images, whereas eleven examined LLMs that summarize textual findings into radiology impressions. Six studies evaluated out-of-the-box (base) models, and nine analyzed models that had been fine-tuned. Twelve investigations paired automated natural-language metrics with radiologist review, while three relied on automated metrics. Fine-tuned models demonstrated better alignment with expert evaluations and achieved higher performance on natural language processing metrics compared to base models. All LLMs showed hallucinations, misdiagnoses, and inconsistencies.</div></div><div><h3>Conclusion</h3><div>LLMs show promise in radiology reporting. However, limitations in diagnostic accuracy and hallucinations necessitate human oversight. Future research should focus on improving evaluation frameworks, incorporating diverse datasets, and prospectively validating AI-generated reports in clinical workflows.</div></div>","PeriodicalId":73399,"journal":{"name":"Intelligence-based medicine","volume":"12 ","pages":"Article 100287"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence-based medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666521225000912","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Rationale and objectives
Large language models (LLMs) and vision-language models (VLMs), have emerged as potential tools for automated radiology reporting. However, concerns regarding their fidelity, reliability, and clinical applicability remain. This systematic review examines the current literature on LLM-generated radiology reports. Assessing their fidelity, clinical reliability, and effectiveness. The review aims to identify benefits, limitations, and key factors influencing AI-generated report quality.
Materials and methods
We conducted a systematic search of MEDLINE, Google Scholar, Scopus, and Web of Science to identify studies published between January 2015 and July 2025. Studies evaluating VLM/LLM-generated radiology reports were included (Transformer-based generative large language models). The study follows PRISMA guidelines. Risk of bias was assessed using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool.
Results
Fifteen studies met the inclusion criteria. Four assessed VLMs that generate full radiology reports directly from images, whereas eleven examined LLMs that summarize textual findings into radiology impressions. Six studies evaluated out-of-the-box (base) models, and nine analyzed models that had been fine-tuned. Twelve investigations paired automated natural-language metrics with radiologist review, while three relied on automated metrics. Fine-tuned models demonstrated better alignment with expert evaluations and achieved higher performance on natural language processing metrics compared to base models. All LLMs showed hallucinations, misdiagnoses, and inconsistencies.
Conclusion
LLMs show promise in radiology reporting. However, limitations in diagnostic accuracy and hallucinations necessitate human oversight. Future research should focus on improving evaluation frameworks, incorporating diverse datasets, and prospectively validating AI-generated reports in clinical workflows.
基本原理和目标大型语言模型(llm)和视觉语言模型(vlm)已经成为自动化放射学报告的潜在工具。然而,对其保真度、可靠性和临床适用性的担忧仍然存在。本系统综述检查了llm生成的放射学报告的当前文献。评估它们的保真度、临床可靠性和有效性。本综述旨在确定影响人工智能生成报告质量的益处、局限性和关键因素。材料和方法我们对MEDLINE、谷歌Scholar、Scopus和Web of Science进行了系统检索,以确定2015年1月至2025年7月之间发表的研究。包括评估VLM/ llm生成的放射学报告的研究(基于transformer的生成大语言模型)。这项研究遵循PRISMA的指导方针。使用诊断准确性研究质量评估(QUADAS-2)工具评估偏倚风险。结果15项研究符合纳入标准。4个评估llm直接从图像生成完整的放射学报告,而11个检查llm将文本发现总结为放射学印象。6项研究评估了现成的(基础)模型,9项研究分析了经过微调的模型。12项调查将自动自然语言指标与放射科医生的评估相结合,而3项调查依赖于自动指标。与基本模型相比,经过微调的模型显示出与专家评估更好的一致性,并且在自然语言处理指标上取得了更高的性能。所有法学硕士均出现幻觉、误诊和不一致。结论法学硕士在放射学报告中具有广阔的应用前景。然而,诊断准确性的限制和幻觉需要人类的监督。未来的研究应侧重于改进评估框架,整合不同的数据集,并在临床工作流程中前瞻性地验证人工智能生成的报告。