Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases.

IF 2.6 3区医学 Q2 CLINICAL NEUROLOGY

Clinical Neuroradiology Pub Date : 2024-12-01 Epub Date: 2024-05-28 DOI:10.1007/s00062-024-01426-y

Daisuke Horiuchi, Hiroyuki Tatekawa, Tatsushi Oura, Satoshi Oue, Shannon L Walston, Hirotaka Takita, Shu Matsushita, Yasuhito Mitsuyama, Taro Shimono, Yukio Miki, Daiju Ueda

{"title":"Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases.","authors":"Daisuke Horiuchi, Hiroyuki Tatekawa, Tatsushi Oura, Satoshi Oue, Shannon L Walston, Hirotaka Takita, Shu Matsushita, Yasuhito Mitsuyama, Taro Shimono, Yukio Miki, Daiju Ueda","doi":"10.1007/s00062-024-01426-y","DOIUrl":null,"url":null,"abstract":"Purpose: To compare the diagnostic performance among Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT‑4 with vision (GPT-4V) based ChatGPT, and radiologists in challenging neuroradiology cases.Methods: We collected 32 consecutive \"Freiburg Neuropathology Case Conference\" cases from the journal Clinical Neuroradiology between March 2016 and December 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Six radiologists (three radiology residents and three board-certified radiologists) independently reviewed all cases and provided diagnoses. ChatGPT and radiologists' diagnostic accuracy rates were evaluated based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists.Results: GPT‑4 and GPT-4V-based ChatGPTs achieved accuracy rates of 22% (7/32) and 16% (5/32), respectively. Radiologists achieved the following accuracy rates: three radiology residents 28% (9/32), 31% (10/32), and 28% (9/32); and three board-certified radiologists 38% (12/32), 47% (15/32), and 44% (14/32). GPT-4-based ChatGPT's diagnostic accuracy was lower than each radiologist, although not significantly (all p > 0.07). GPT-4V-based ChatGPT's diagnostic accuracy was also lower than each radiologist and significantly lower than two board-certified radiologists (p = 0.02 and 0.03) (not significant for radiology residents and one board-certified radiologist [all p > 0.09]).Conclusion: While GPT-4-based ChatGPT demonstrated relatively higher diagnostic performance than GPT-4V-based ChatGPT, the diagnostic performance of GPT‑4 and GPT-4V-based ChatGPTs did not reach the performance level of either radiology residents or board-certified radiologists in challenging neuroradiology cases.","PeriodicalId":49298,"journal":{"name":"Clinical Neuroradiology","volume":" ","pages":"779-787"},"PeriodicalIF":2.6000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Neuroradiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00062-024-01426-y","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/5/28 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To compare the diagnostic performance among Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT‑4 with vision (GPT-4V) based ChatGPT, and radiologists in challenging neuroradiology cases.

Methods: We collected 32 consecutive "Freiburg Neuropathology Case Conference" cases from the journal Clinical Neuroradiology between March 2016 and December 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Six radiologists (three radiology residents and three board-certified radiologists) independently reviewed all cases and provided diagnoses. ChatGPT and radiologists' diagnostic accuracy rates were evaluated based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists.

Results: GPT‑4 and GPT-4V-based ChatGPTs achieved accuracy rates of 22% (7/32) and 16% (5/32), respectively. Radiologists achieved the following accuracy rates: three radiology residents 28% (9/32), 31% (10/32), and 28% (9/32); and three board-certified radiologists 38% (12/32), 47% (15/32), and 44% (14/32). GPT-4-based ChatGPT's diagnostic accuracy was lower than each radiologist, although not significantly (all p > 0.07). GPT-4V-based ChatGPT's diagnostic accuracy was also lower than each radiologist and significantly lower than two board-certified radiologists (p = 0.02 and 0.03) (not significant for radiology residents and one board-certified radiologist [all p > 0.09]).

Conclusion: While GPT-4-based ChatGPT demonstrated relatively higher diagnostic performance than GPT-4V-based ChatGPT, the diagnostic performance of GPT‑4 and GPT-4V-based ChatGPTs did not reach the performance level of either radiology residents or board-certified radiologists in challenging neuroradiology cases.

Abstract Image

查看原文本刊更多论文

比较基于 GPT-4 的 ChatGPT、基于 GPT-4V 的 ChatGPT 和放射科医生在神经放射学疑难病例中的诊断效果。

目的：比较基于生成预训练变换器（GPT）-4 的 ChatGPT、基于视觉的 GPT-4 的 ChatGPT 和放射科医生在具有挑战性的神经放射学病例中的诊断性能：我们从《临床神经放射学》杂志中收集了 2016 年 3 月至 2023 年 12 月间的 32 个连续的 "弗莱堡神经病理学病例会议 "病例。我们将病史和影像检查结果输入基于 GPT-4 的 ChatGPT，将病史和影像输入基于 GPT-4V 的 ChatGPT，然后两者为每个病例生成诊断。六位放射科医生（三位放射科住院医师和三位经委员会认证的放射科医生）独立审查所有病例并提供诊断。根据已公布的基本事实对 ChatGPT 和放射科医生的诊断准确率进行了评估。对基于 GPT-4 的 ChatGPT、基于 GPT-4V 的 ChatGPT 和放射医师的诊断准确率进行了卡方检验：结果：基于 GPT-4 和 GPT-4V 的 ChatGPT 的准确率分别为 22%（7/32）和 16%（5/32）。放射科医生的准确率如下：三位放射科住院医生分别为 28%（9/32）、31%（10/32）和 28%（9/32）；三位经委员会认证的放射科医生分别为 38%（12/32）、47%（15/32）和 44%（14/32）。基于 GPT-4 的 ChatGPT 诊断准确率低于每位放射科医生，但差异不明显（均 p > 0.07）。基于 GPT-4V 的 ChatGPT 诊断准确性也低于每位放射科医生，且明显低于两位获得医学会认证的放射科医生（P = 0.02 和 0.03）（放射科住院医师和一位获得医学会认证的放射科医生的诊断准确性不显著[所有 P > 0.09]）：结论：虽然基于 GPT-4 的 ChatGPT 的诊断性能相对高于基于 GPT-4V 的 ChatGPT，但在具有挑战性的神经放射学病例中，基于 GPT-4 和 GPT-4V 的 ChatGPT 的诊断性能并未达到放射科住院医师或具有医师资格的放射科医师的性能水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical Neuroradiology CLINICAL NEUROLOGY-RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

CiteScore

5.00

自引率

3.60%

发文量

106

审稿时长

>12 weeks

期刊介绍： Clinical Neuroradiology provides current information, original contributions, and reviews in the field of neuroradiology. An interdisciplinary approach is accomplished by diagnostic and therapeutic contributions related to associated subjects. The international coverage and relevance of the journal is underlined by its being the official journal of the German, Swiss, and Austrian Societies of Neuroradiology.