Automated cardiac magnetic resonance interpretation derived from prompted large language models.

IF 2.1 3区医学 Q3 CARDIAC & CARDIOVASCULAR SYSTEMS

Cardiovascular diagnosis and therapy Pub Date : 2025-08-30 Epub Date: 2025-08-28 DOI:10.21037/cdt-2025-112

Lujing Wang, Liang Peng, Yixuan Wan, Xingyu Li, Yixin Chen, Li Wang, Xiuxian Gong, Xiaoying Zhao, Lequan Yu, Shihua Zhao, Xinxiang Zhao

{"title":"Automated cardiac magnetic resonance interpretation derived from prompted large language models.","authors":"Lujing Wang, Liang Peng, Yixuan Wan, Xingyu Li, Yixin Chen, Li Wang, Xiuxian Gong, Xiaoying Zhao, Lequan Yu, Shihua Zhao, Xinxiang Zhao","doi":"10.21037/cdt-2025-112","DOIUrl":null,"url":null,"abstract":"Background: The versatility of cardiac magnetic resonance (CMR) leads to complex and time-consuming interpretation. Large language models (LLMs) present transformative potential for automated CMR interpretations. We explored the ability of LLMs in the automated classification and diagnosis of CMR reports for three common cardiac diseases: myocardial infarction (MI), dilated cardiomyopathy (DCM), and hypertrophic cardiomyopathy (HCM).Methods: This retrospective study enrolled CMR reports of consecutive patients from January 2015 to July 2024, including reports from three types of cardiac diseases: MI, DCM, and HCM. Six LLMs, including GPT-3.5, GPT-4.0, Gemini-1.0, Gemini-1.5, PaLM, and LLaMA, were used to classify and diagnose the CMR reports. The results of the LLMs, with minimal or informative prompts, were compared with those of radiologists. Accuracy (ACC) and balanced accuracy (BAC) were used to evaluate the classification performance of the different LLMs. The consistency between radiologists and LLMs in classifying heart disease categories was evaluated using Gwet's Agreement Coefficient (AC1 value). Diagnostic performance was analyzed through receiver operating characteristic (ROC) curves. Cohen's kappa was used to assess the reproducibility of the LLMs' diagnostic results obtained at different time intervals (a 30-day interval).Results: This study enrolled 543 CMR cases, including 275 MI, 120 DCM, and 148 HCM cases. The overall BAC of the minimal prompted LLMs, from highest to lowest, were GPT-4.0, LLaMA, PaLM, GPT-3.5, Gemini-1.5, and Gemini-1.0. The informative prompted models of GPT-3.5 (P<0.001), GPT-4.0 (P<0.001), Gemini-1.0 (P<0.001), Gemini-1.5 (P=0.02), and PaLM (P<0.001) showed significant improvements in overall ACC compared to their minimal prompted models, whereas the informative prompted model of LLaMA did not show a significant improvement in overall ACC compared to the minimal prompted model (P=0.06). GPT-4.0 performed best in both the minimal prompted (ACC =88.6%, BAC =91.7%) and informative prompted (ACC =95.8%, BAC =97.1%) models. GPT-4.0 demonstrated the highest agreement with radiologists [AC1=0.82, 95% confidence interval (CI): 0.78-0.86], significantly outperforming others (P<0.001). For the informative prompted models of LLMs, GPT-4.0 + informative prompt (AC1=0.93, 95% CI: 0.90-0.96), GPT-3.5 + informative prompt (AC1=0.93, 95% CI: 0.90-0.95), Gemini-1.0 + informative prompt (AC1=0.90, 95% CI: 0.87-0.93), PaLM + informative prompt (AC1=0.86, 95% CI: 0.82-0.90), LLaMA + informative prompt (AC1=0.82, 95% CI: 0.78-0.86), and Gemini-1.5 + informative prompt (AC1=0.80, 95% CI: 0.76-0.84) all showed almost perfect agreement with radiologists' diagnoses. Diagnostic performance was excellent for GPT-4.0 [area under the curve (AUC)=0.93, 95% CI: 0.92-0.95] and LLaMA (AUC =0.92, 95% CI: 0.90-0.94) in minimal prompted models, while informative prompted models achieved superior performance, with GPT-4.0 + informative prompt reaching the highest AUC of 0.98 (95% CI: 0.97-0.99). All models demonstrated good reproducibility (κ>0.80, P<0.001).Conclusions: LLMs demonstrated outstanding performance in the automated classification and diagnosis of targeted CMR interpretations, especially with informative prompts, suggesting the potential for these models to serve as adjunct tools in CMR diagnostic workflows.","PeriodicalId":9592,"journal":{"name":"Cardiovascular diagnosis and therapy","volume":"15 4","pages":"726-737"},"PeriodicalIF":2.1000,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12432601/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cardiovascular diagnosis and therapy","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.21037/cdt-2025-112","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/28 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"CARDIAC & CARDIOVASCULAR SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The versatility of cardiac magnetic resonance (CMR) leads to complex and time-consuming interpretation. Large language models (LLMs) present transformative potential for automated CMR interpretations. We explored the ability of LLMs in the automated classification and diagnosis of CMR reports for three common cardiac diseases: myocardial infarction (MI), dilated cardiomyopathy (DCM), and hypertrophic cardiomyopathy (HCM).

Methods: This retrospective study enrolled CMR reports of consecutive patients from January 2015 to July 2024, including reports from three types of cardiac diseases: MI, DCM, and HCM. Six LLMs, including GPT-3.5, GPT-4.0, Gemini-1.0, Gemini-1.5, PaLM, and LLaMA, were used to classify and diagnose the CMR reports. The results of the LLMs, with minimal or informative prompts, were compared with those of radiologists. Accuracy (ACC) and balanced accuracy (BAC) were used to evaluate the classification performance of the different LLMs. The consistency between radiologists and LLMs in classifying heart disease categories was evaluated using Gwet's Agreement Coefficient (AC1 value). Diagnostic performance was analyzed through receiver operating characteristic (ROC) curves. Cohen's kappa was used to assess the reproducibility of the LLMs' diagnostic results obtained at different time intervals (a 30-day interval).

Results: This study enrolled 543 CMR cases, including 275 MI, 120 DCM, and 148 HCM cases. The overall BAC of the minimal prompted LLMs, from highest to lowest, were GPT-4.0, LLaMA, PaLM, GPT-3.5, Gemini-1.5, and Gemini-1.0. The informative prompted models of GPT-3.5 (P<0.001), GPT-4.0 (P<0.001), Gemini-1.0 (P<0.001), Gemini-1.5 (P=0.02), and PaLM (P<0.001) showed significant improvements in overall ACC compared to their minimal prompted models, whereas the informative prompted model of LLaMA did not show a significant improvement in overall ACC compared to the minimal prompted model (P=0.06). GPT-4.0 performed best in both the minimal prompted (ACC =88.6%, BAC =91.7%) and informative prompted (ACC =95.8%, BAC =97.1%) models. GPT-4.0 demonstrated the highest agreement with radiologists [AC1=0.82, 95% confidence interval (CI): 0.78-0.86], significantly outperforming others (P<0.001). For the informative prompted models of LLMs, GPT-4.0 + informative prompt (AC1=0.93, 95% CI: 0.90-0.96), GPT-3.5 + informative prompt (AC1=0.93, 95% CI: 0.90-0.95), Gemini-1.0 + informative prompt (AC1=0.90, 95% CI: 0.87-0.93), PaLM + informative prompt (AC1=0.86, 95% CI: 0.82-0.90), LLaMA + informative prompt (AC1=0.82, 95% CI: 0.78-0.86), and Gemini-1.5 + informative prompt (AC1=0.80, 95% CI: 0.76-0.84) all showed almost perfect agreement with radiologists' diagnoses. Diagnostic performance was excellent for GPT-4.0 [area under the curve (AUC)=0.93, 95% CI: 0.92-0.95] and LLaMA (AUC =0.92, 95% CI: 0.90-0.94) in minimal prompted models, while informative prompted models achieved superior performance, with GPT-4.0 + informative prompt reaching the highest AUC of 0.98 (95% CI: 0.97-0.99). All models demonstrated good reproducibility (κ>0.80, P<0.001).

Conclusions: LLMs demonstrated outstanding performance in the automated classification and diagnosis of targeted CMR interpretations, especially with informative prompts, suggesting the potential for these models to serve as adjunct tools in CMR diagnostic workflows.

Abstract Image

查看原文本刊更多论文

自动心脏磁共振解释源自提示的大型语言模型。

背景：心脏磁共振（CMR）的多功能性导致解释复杂且耗时。大型语言模型（llm）为自动化CMR解释提供了变革潜力。我们探讨了LLMs在三种常见心脏疾病（心肌梗死（MI）、扩张型心肌病（DCM）和肥厚型心肌病（HCM））的CMR报告自动分类和诊断中的能力。方法：本回顾性研究纳入2015年1月至2024年7月连续患者的CMR报告，包括三种心脏疾病的报告：MI、DCM和HCM。使用GPT-3.5、GPT-4.0、Gemini-1.0、Gemini-1.5、PaLM和LLaMA 6种llm对CMR报告进行分类和诊断。llm的结果与放射科医生的结果进行了比较。采用准确度（ACC）和平衡准确度（BAC）评价不同llm的分类性能。采用Gwet一致系数（AC1值）评价放射科医师与llm在心脏病分类上的一致性。通过受试者工作特征（ROC）曲线分析诊断效果。采用Cohen’s kappa评估不同时间间隔（30天间隔）LLMs诊断结果的可重复性。结果：本研究共纳入543例CMR病例，其中MI 275例，DCM 120例，HCM 148例。最小提示llm的总BAC从高到低依次为GPT-4.0、LLaMA、PaLM、GPT-3.5、Gemini-1.5和Gemini-1.0。结论：LLMs在靶向CMR解释的自动分类和诊断方面表现出色，特别是在信息提示方面，这表明这些模型有潜力作为CMR诊断工作流程的辅助工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cardiovascular diagnosis and therapy Medicine-Cardiology and Cardiovascular Medicine

CiteScore

4.90

自引率

4.20%

发文量

期刊介绍： The journal ''Cardiovascular Diagnosis and Therapy'' (Print ISSN: 2223-3652; Online ISSN: 2223-3660) accepts basic and clinical science submissions related to Cardiovascular Medicine and Surgery. The mission of the journal is the rapid exchange of scientific information between clinicians and scientists worldwide. To reach this goal, the journal will focus on novel media, using a web-based, digital format in addition to traditional print-version. This includes on-line submission, review, publication, and distribution. The digital format will also allow submission of extensive supporting visual material, both images and video. The website www.thecdt.org will serve as the central hub and also allow posting of comments and on-line discussion. The web-site of the journal will be linked to a number of international web-sites (e.g. www.dxy.cn), which will significantly expand the distribution of its contents.