Turay Cesur, Yasin Celal Gunes, Eren Camur, Mustafa Dağli
{"title":"Empowering Radiologists With ChatGPT-4o: Comparative Evaluation of Large Language Models and Radiologists in Cardiac Cases.","authors":"Turay Cesur, Yasin Celal Gunes, Eren Camur, Mustafa Dağli","doi":"10.1097/RTI.0000000000000846","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study evaluated the diagnostic accuracy and differential diagnostic capabilities of 12 Large Language Models (LLMs), one cardiac radiologist, and 3 general radiologists in cardiac radiology. The impact of the ChatGPT-4o assistance on radiologist performance was also investigated.</p><p><strong>Materials and methods: </strong>We collected publicly available 80 \"Cardiac Case of the Month\" from the Society of Thoracic Radiology website. LLMs and Radiologist-III were provided with text-based information, whereas other radiologists visually assessed the cases with and without the ChatGPT-4o assistance. Diagnostic accuracy and differential diagnosis scores (DDx scores) were analyzed using the χ2, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests.</p><p><strong>Results: </strong>The unassisted diagnostic accuracy of the cardiac radiologist was 72.5%, general radiologist-I was 53.8%, and general radiologist-II was 51.3%. With ChatGPT-4o, the accuracy improved to 78.8%, 70.0%, and 63.8%, respectively. The improvements for general radiologists-I and II were statistically significant (P≤0.006). All radiologists' DDx scores improved significantly with ChatGPT-4o assistance (P≤0.05). Remarkably, Radiologist-I's GPT-4o-assisted diagnostic accuracy and DDx score were not significantly different from the Cardiac Radiologist's unassisted performance (P>0.05).Among the LLMs, Claude 3 Opus and Claude 3.5 Sonnet had the highest accuracy (81.3%), followed by Claude 3 Sonnet (70.0%). Regarding the DDx score, Claude 3 Opus outperformed all models and radiologist-III (P<0.05). The accuracy of the general radiologist-III significantly improved from 48.8% to 63.8% with GPT4o assistance (P<0.001).</p><p><strong>Conclusions: </strong>ChatGPT-4o may enhance the diagnostic performance of general radiologists in cardiac imaging, suggesting its potential as a diagnostic support tool. Further studies are required to assess the clinical integration.</p>","PeriodicalId":49974,"journal":{"name":"Journal of Thoracic Imaging","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Thoracic Imaging","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/RTI.0000000000000846","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: This study evaluated the diagnostic accuracy and differential diagnostic capabilities of 12 Large Language Models (LLMs), one cardiac radiologist, and 3 general radiologists in cardiac radiology. The impact of the ChatGPT-4o assistance on radiologist performance was also investigated.
Materials and methods: We collected publicly available 80 "Cardiac Case of the Month" from the Society of Thoracic Radiology website. LLMs and Radiologist-III were provided with text-based information, whereas other radiologists visually assessed the cases with and without the ChatGPT-4o assistance. Diagnostic accuracy and differential diagnosis scores (DDx scores) were analyzed using the χ2, Kruskal-Wallis, Wilcoxon, McNemar, and Mann-Whitney U tests.
Results: The unassisted diagnostic accuracy of the cardiac radiologist was 72.5%, general radiologist-I was 53.8%, and general radiologist-II was 51.3%. With ChatGPT-4o, the accuracy improved to 78.8%, 70.0%, and 63.8%, respectively. The improvements for general radiologists-I and II were statistically significant (P≤0.006). All radiologists' DDx scores improved significantly with ChatGPT-4o assistance (P≤0.05). Remarkably, Radiologist-I's GPT-4o-assisted diagnostic accuracy and DDx score were not significantly different from the Cardiac Radiologist's unassisted performance (P>0.05).Among the LLMs, Claude 3 Opus and Claude 3.5 Sonnet had the highest accuracy (81.3%), followed by Claude 3 Sonnet (70.0%). Regarding the DDx score, Claude 3 Opus outperformed all models and radiologist-III (P<0.05). The accuracy of the general radiologist-III significantly improved from 48.8% to 63.8% with GPT4o assistance (P<0.001).
Conclusions: ChatGPT-4o may enhance the diagnostic performance of general radiologists in cardiac imaging, suggesting its potential as a diagnostic support tool. Further studies are required to assess the clinical integration.
期刊介绍:
Journal of Thoracic Imaging (JTI) provides authoritative information on all aspects of the use of imaging techniques in the diagnosis of cardiac and pulmonary diseases. Original articles and analytical reviews published in this timely journal provide the very latest thinking of leading experts concerning the use of chest radiography, computed tomography, magnetic resonance imaging, positron emission tomography, ultrasound, and all other promising imaging techniques in cardiopulmonary radiology.
Official Journal of the Society of Thoracic Radiology:
Japanese Society of Thoracic Radiology
Korean Society of Thoracic Radiology
European Society of Thoracic Imaging.