Evaluation of large language models in generating pulmonary nodule follow-up recommendations

IF 2.9 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

European Journal of Radiology Open Pub Date : 2025-04-30 DOI:10.1016/j.ejro.2025.100655

Junzhe Wen , Wanyue Huang , Huzheng Yan , Jie Sun , Mengshi Dong , Chao Li , Jie Qin

{"title":"Evaluation of large language models in generating pulmonary nodule follow-up recommendations","authors":"Junzhe Wen , Wanyue Huang , Huzheng Yan , Jie Sun , Mengshi Dong , Chao Li , Jie Qin","doi":"10.1016/j.ejro.2025.100655","DOIUrl":null,"url":null,"abstract":"<div><h3>Rationale and objectives</h3><div>To evaluate the performance of large language models (LLMs) in generating clinically follow-up recommendations for pulmonary nodules by leveraging radiological report findings and management guidelines.</div></div><div><h3>Materials and methods</h3><div>This retrospective study included CT follow-up reports of pulmonary nodules documented by senior radiologists from September 1st, 2023, to April 30th, 2024. Sixty reports were collected for prompting engineering additionally, based on few-shot learning and the Chain of Thought methodology. Radiological findings of pulmonary nodules, along with finally prompt, were input into GPT-4o-mini or ERNIE-4.0-Turbo-8K to generate follow-up recommendations. The AI-generated recommendations were evaluated against radiologist-defined guideline-based standards through binary classification, assessing nodule risk classifications, follow-up intervals, and harmfulness. Performance metrics included sensitivity, specificity, positive/negative predictive values, and F1 score.</div></div><div><h3>Results</h3><div>On 1009 reports from 996 patients (median age, 50.0 years, IQR, 39.0–60.0 years; 511 male patients), ERNIE-4.0-Turbo-8K and GPT-4o-mini demonstrated comparable performance in both accuracy of follow-up recommendations (94.6 % vs 92.8 %, P = 0.07) and harmfulness rates (2.9 % vs 3.5 %, P = 0.48). In nodules classification, ERNIE-4.0-Turbo-8K and GPT-4o-mini performed similarly with accuracy rates of 99.8 % vs 99.9 % sensitivity of 96.9 % vs 100.0 %, specificity of 99.9 % vs 99.9 %, positive predictive value of 96.9 % vs 96.9 %, negative predictive value of 100.0 % vs 99.9 %, f1-score of 96.9 % vs 98.4 %, respectively.</div></div><div><h3>Conclusion</h3><div>LLMs show promise in providing guideline-based follow-up recommendations for pulmonary nodules, but require rigorous validation and supervision to mitigate potential clinical risks. This study offers insights into their potential role in automated radiological decision support.</div></div>","PeriodicalId":38076,"journal":{"name":"European Journal of Radiology Open","volume":"14 ","pages":"Article 100655"},"PeriodicalIF":2.9000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Radiology Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S235204772500022X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Rationale and objectives

To evaluate the performance of large language models (LLMs) in generating clinically follow-up recommendations for pulmonary nodules by leveraging radiological report findings and management guidelines.

Materials and methods

This retrospective study included CT follow-up reports of pulmonary nodules documented by senior radiologists from September 1st, 2023, to April 30th, 2024. Sixty reports were collected for prompting engineering additionally, based on few-shot learning and the Chain of Thought methodology. Radiological findings of pulmonary nodules, along with finally prompt, were input into GPT-4o-mini or ERNIE-4.0-Turbo-8K to generate follow-up recommendations. The AI-generated recommendations were evaluated against radiologist-defined guideline-based standards through binary classification, assessing nodule risk classifications, follow-up intervals, and harmfulness. Performance metrics included sensitivity, specificity, positive/negative predictive values, and F1 score.

Results

On 1009 reports from 996 patients (median age, 50.0 years, IQR, 39.0–60.0 years; 511 male patients), ERNIE-4.0-Turbo-8K and GPT-4o-mini demonstrated comparable performance in both accuracy of follow-up recommendations (94.6 % vs 92.8 %, P = 0.07) and harmfulness rates (2.9 % vs 3.5 %, P = 0.48). In nodules classification, ERNIE-4.0-Turbo-8K and GPT-4o-mini performed similarly with accuracy rates of 99.8 % vs 99.9 % sensitivity of 96.9 % vs 100.0 %, specificity of 99.9 % vs 99.9 %, positive predictive value of 96.9 % vs 96.9 %, negative predictive value of 100.0 % vs 99.9 %, f1-score of 96.9 % vs 98.4 %, respectively.

Conclusion

LLMs show promise in providing guideline-based follow-up recommendations for pulmonary nodules, but require rigorous validation and supervision to mitigate potential clinical risks. This study offers insights into their potential role in automated radiological decision support.

查看原文本刊更多论文

大语言模型在生成肺结节随访建议中的评价

依据放射学报告结果和管理指南，评估大语言模型（LLMs）在生成肺结节临床随访建议方面的表现。材料与方法本回顾性研究纳入2023年9月1日至2024年4月30日资深放射科医师记录的肺结节CT随访报告。采用少弹学习和思维链方法，收集了60份报告，并进行了额外的工程提示。肺结节的影像学表现，以及最终提示，输入gpt - 40 -mini或ERNIE-4.0-Turbo-8K，以产生随访建议。通过二元分类、评估结节风险分类、随访间隔和危害，根据放射科医生定义的基于指南的标准对人工智能生成的建议进行评估。性能指标包括敏感性、特异性、阳性/阴性预测值和F1评分。结果996例患者报告1009份(中位年龄50.0岁，IQR 39.0 ~ 60.0岁；511例男性患者)、erie -4.0- turbo - 8k和gpt - 40 -mini在随访建议的准确性（94.6 % vs 92.8 %,P = 0.07）和有害率（2.9 % vs 3.5 %,P = 0.48）方面表现相当。结节的分类、厄尼- 4.0 -涡轮- 8 - k和GPT-4o-mini执行同样的准确率为99.8 vs 99.9  % % 96.9 vs 100.0  % %的敏感性,特异性99.9 vs 99.9  % %,阳性预测值96.9 vs 96.9  % %,负面预测值100.0 vs 99.9  % %,f1-score 96.9 vs 98.4  % %,分别。结论llm有望为肺结节提供基于指南的随访建议，但需要严格的验证和监督以降低潜在的临床风险。这项研究为它们在自动化放射决策支持中的潜在作用提供了见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊