chatgpt - 40在放射学中的诊断准确性和一致性：图像、临床数据和答案选项对性能的影响

IF 1.7 4区医学 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Diagnostic and interventional radiology Pub Date : 2025-09-22 DOI:10.4274/dir.2025.253460

Kadir Atakır, Kaan Işın, Abdullah Taş, Hakan Önder

{"title":"chatgpt - 40在放射学中的诊断准确性和一致性：图像、临床数据和答案选项对性能的影响","authors":"Kadir Atakır, Kaan Işın, Abdullah Taş, Hakan Önder","doi":"10.4274/dir.2025.253460","DOIUrl":null,"url":null,"abstract":"Purpose: This study aimed to evaluate the diagnostic accuracy of Chat Generative Pre-trained Transformer (ChatGPT) version 4 Omni (ChatGPT-4o) in radiology across seven information input combinations (image, clinical data, and multiple-choice options) to assess the consistency of its outputs across repeated trials and to compare its performance with that of human radiologists.Methods: We tested 129 distinct radiology cases under seven input conditions (varying presence of imaging, clinical context, and answer options). Each case was processed by ChatGPT-4o for seven different input combinations on three separate accounts. Diagnostic accuracy was determined by comparison with ground-truth diagnoses, and interobserver consistency was measured using Fleiss' kappa. Pairwise comparisons were performed with the Wilcoxon signed-rank test. Additionally, the same set of cases was evaluated by nine radiology residents to benchmark ChatGPT-4o's performance against human diagnostic accuracy.Results: ChatGPT-4o's diagnostic accuracy was lowest for \"image only\" (19.90%) and \"options only\" (20.67%) conditions. The highest accuracy was observed in \"image + clinical information + options\" (80.88%) and \"clinical information + options\" (75.45%) conditions. The highest interobserver agreement was observed in the \"image + clinical information + options\" condition (κ = 0.733) and the lowest was in the \"options only\" condition (κ = 0.023), suggesting that more information improves consistency. However, there was no effective benefit of adding imaging data over already provided clinical data and options, as seen in post-hoc analysis. In human comparison, ChatGPT-4o outperformed radiology residents in text-based configurations (75.45% vs. 42.89%), whereas residents showed slightly better performance in image-based tasks (64.13% vs. 61.24%). Notably, when residents were allowed to use ChatGPT-4o as a support tool, their image-based diagnostic accuracy increased from 63.04% to 74.16%.Conclusion: ChatGPT-4o performs well when provided with rich textual input but remains limited in purely image- based diagnoses. Its accuracy and consistency increase with multimodal input, yet adding imaging does not significantly improve performance beyond clinical context and diagnostic options alone. The model's superior performance to residents in text-based tasks underscores its potential as a diagnostic aid in structured scenarios. Furthermore, its integration as a support tool may enhance human diagnostic accuracy, particularly in image-based interpretation.Clinical significance: Although ChatGPT-4o is not yet capable of reliably interpreting radiologic images on its own, it demonstrates strong performance in text-based diagnostic reasoning. Its integration into clinical workflows-particularly for triage, structured decision support, or educational purposes-may augment radiologists' diagnostic capacity and consistency.","PeriodicalId":11341,"journal":{"name":"Diagnostic and interventional radiology","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Diagnostic accuracy and consistency of ChatGPT-4o in radiology: influence of image, clinical data, and answer options on performance.\",\"authors\":\"Kadir Atakır, Kaan Işın, Abdullah Taş, Hakan Önder\",\"doi\":\"10.4274/dir.2025.253460\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: This study aimed to evaluate the diagnostic accuracy of Chat Generative Pre-trained Transformer (ChatGPT) version 4 Omni (ChatGPT-4o) in radiology across seven information input combinations (image, clinical data, and multiple-choice options) to assess the consistency of its outputs across repeated trials and to compare its performance with that of human radiologists.Methods: We tested 129 distinct radiology cases under seven input conditions (varying presence of imaging, clinical context, and answer options). Each case was processed by ChatGPT-4o for seven different input combinations on three separate accounts. Diagnostic accuracy was determined by comparison with ground-truth diagnoses, and interobserver consistency was measured using Fleiss' kappa. Pairwise comparisons were performed with the Wilcoxon signed-rank test. Additionally, the same set of cases was evaluated by nine radiology residents to benchmark ChatGPT-4o's performance against human diagnostic accuracy.Results: ChatGPT-4o's diagnostic accuracy was lowest for \\\"image only\\\" (19.90%) and \\\"options only\\\" (20.67%) conditions. The highest accuracy was observed in \\\"image + clinical information + options\\\" (80.88%) and \\\"clinical information + options\\\" (75.45%) conditions. The highest interobserver agreement was observed in the \\\"image + clinical information + options\\\" condition (κ = 0.733) and the lowest was in the \\\"options only\\\" condition (κ = 0.023), suggesting that more information improves consistency. However, there was no effective benefit of adding imaging data over already provided clinical data and options, as seen in post-hoc analysis. In human comparison, ChatGPT-4o outperformed radiology residents in text-based configurations (75.45% vs. 42.89%), whereas residents showed slightly better performance in image-based tasks (64.13% vs. 61.24%). Notably, when residents were allowed to use ChatGPT-4o as a support tool, their image-based diagnostic accuracy increased from 63.04% to 74.16%.Conclusion: ChatGPT-4o performs well when provided with rich textual input but remains limited in purely image- based diagnoses. Its accuracy and consistency increase with multimodal input, yet adding imaging does not significantly improve performance beyond clinical context and diagnostic options alone. The model's superior performance to residents in text-based tasks underscores its potential as a diagnostic aid in structured scenarios. Furthermore, its integration as a support tool may enhance human diagnostic accuracy, particularly in image-based interpretation.Clinical significance: Although ChatGPT-4o is not yet capable of reliably interpreting radiologic images on its own, it demonstrates strong performance in text-based diagnostic reasoning. Its integration into clinical workflows-particularly for triage, structured decision support, or educational purposes-may augment radiologists' diagnostic capacity and consistency.\",\"PeriodicalId\":11341,\"journal\":{\"name\":\"Diagnostic and interventional radiology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2025-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Diagnostic and interventional radiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.4274/dir.2025.253460\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostic and interventional radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.4274/dir.2025.253460","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

摘要

目的：本研究旨在评估聊天生成预训练变压器（ChatGPT）版本4 Omni （ChatGPT- 40）在放射学中七种信息输入组合（图像、临床数据和多项选择选项）的诊断准确性，以评估其重复试验输出的一致性，并将其性能与人类放射科医生的性能进行比较。方法：我们在7个输入条件（不同的影像、临床背景和回答选项）下测试了129个不同的放射学病例。每个案例都由chatgpt - 40处理，在三个单独的账户上使用七种不同的输入组合。通过与真实诊断的比较来确定诊断准确性，并使用Fleiss' kappa测量观察者之间的一致性。两两比较采用Wilcoxon符号秩检验。此外，9名放射科住院医师对同一组病例进行了评估，以将chatgpt - 40的性能与人类诊断准确性进行比较。结果：chatgpt - 40在“仅图像”（19.90%）和“仅选项”（20.67%）条件下的诊断准确率最低。“图像+临床信息+选项”和“临床信息+选项”的准确率最高，分别为80.88%和75.45%。在“图像+临床信息+选项”条件下，观察者之间的一致性最高（κ = 0.733），而在“只有选项”条件下，观察者之间的一致性最低（κ = 0.023），表明更多的信息可以提高一致性。然而，正如事后分析所见，在已经提供的临床数据和选择的基础上增加影像学数据并没有有效的好处。在人类比较中，chatgpt - 40在基于文本的配置中表现优于放射科住院医生（75.45%对42.89%），而住院医生在基于图像的任务中表现稍好（64.13%对61.24%）。值得注意的是，当允许居民使用chatgpt - 40作为辅助工具时，他们基于图像的诊断准确率从63.04%提高到74.16%。结论：chatgpt - 40在提供丰富的文本输入时表现良好，但在纯基于图像的诊断中仍然有限。它的准确性和一致性随着多模式输入的增加而增加，但除了临床背景和诊断选项之外，增加成像并不能显著提高性能。该模型在基于文本的任务中表现优异，突显了它在结构化场景中作为诊断辅助工具的潜力。此外，它作为辅助工具的整合可以提高人类诊断的准确性，特别是在基于图像的解释中。临床意义：虽然chatgpt - 40本身还不能可靠地解释放射图像，但它在基于文本的诊断推理中表现出很强的性能。将其集成到临床工作流程中，特别是用于分诊、结构化决策支持或教育目的，可以增强放射科医生的诊断能力和一致性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Diagnostic accuracy and consistency of ChatGPT-4o in radiology: influence of image, clinical data, and answer options on performance.

Purpose: This study aimed to evaluate the diagnostic accuracy of Chat Generative Pre-trained Transformer (ChatGPT) version 4 Omni (ChatGPT-4o) in radiology across seven information input combinations (image, clinical data, and multiple-choice options) to assess the consistency of its outputs across repeated trials and to compare its performance with that of human radiologists.

Methods: We tested 129 distinct radiology cases under seven input conditions (varying presence of imaging, clinical context, and answer options). Each case was processed by ChatGPT-4o for seven different input combinations on three separate accounts. Diagnostic accuracy was determined by comparison with ground-truth diagnoses, and interobserver consistency was measured using Fleiss' kappa. Pairwise comparisons were performed with the Wilcoxon signed-rank test. Additionally, the same set of cases was evaluated by nine radiology residents to benchmark ChatGPT-4o's performance against human diagnostic accuracy.

Results: ChatGPT-4o's diagnostic accuracy was lowest for "image only" (19.90%) and "options only" (20.67%) conditions. The highest accuracy was observed in "image + clinical information + options" (80.88%) and "clinical information + options" (75.45%) conditions. The highest interobserver agreement was observed in the "image + clinical information + options" condition (κ = 0.733) and the lowest was in the "options only" condition (κ = 0.023), suggesting that more information improves consistency. However, there was no effective benefit of adding imaging data over already provided clinical data and options, as seen in post-hoc analysis. In human comparison, ChatGPT-4o outperformed radiology residents in text-based configurations (75.45% vs. 42.89%), whereas residents showed slightly better performance in image-based tasks (64.13% vs. 61.24%). Notably, when residents were allowed to use ChatGPT-4o as a support tool, their image-based diagnostic accuracy increased from 63.04% to 74.16%.

Conclusion: ChatGPT-4o performs well when provided with rich textual input but remains limited in purely image- based diagnoses. Its accuracy and consistency increase with multimodal input, yet adding imaging does not significantly improve performance beyond clinical context and diagnostic options alone. The model's superior performance to residents in text-based tasks underscores its potential as a diagnostic aid in structured scenarios. Furthermore, its integration as a support tool may enhance human diagnostic accuracy, particularly in image-based interpretation.

Clinical significance: Although ChatGPT-4o is not yet capable of reliably interpreting radiologic images on its own, it demonstrates strong performance in text-based diagnostic reasoning. Its integration into clinical workflows-particularly for triage, structured decision support, or educational purposes-may augment radiologists' diagnostic capacity and consistency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Diagnostic and interventional radiology Medicine-Radiology, Nuclear Medicine and Imaging

自引率

4.80%

发文量

期刊介绍： Diagnostic and Interventional Radiology (Diagn Interv Radiol) is the open access, online-only official publication of Turkish Society of Radiology. It is published bimonthly and the journal’s publication language is English. The journal is a medium for original articles, reviews, pictorial essays, technical notes related to all fields of diagnostic and interventional radiology.