Kadir Atakır, Kaan Işın, Abdullah Taş, Hakan Önder
{"title":"chatgpt - 40在放射学中的诊断准确性和一致性:图像、临床数据和答案选项对性能的影响","authors":"Kadir Atakır, Kaan Işın, Abdullah Taş, Hakan Önder","doi":"10.4274/dir.2025.253460","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study aimed to evaluate the diagnostic accuracy of Chat Generative Pre-trained Transformer (ChatGPT) version 4 Omni (ChatGPT-4o) in radiology across seven information input combinations (image, clinical data, and multiple-choice options) to assess the consistency of its outputs across repeated trials and to compare its performance with that of human radiologists.</p><p><strong>Methods: </strong>We tested 129 distinct radiology cases under seven input conditions (varying presence of imaging, clinical context, and answer options). Each case was processed by ChatGPT-4o for seven different input combinations on three separate accounts. Diagnostic accuracy was determined by comparison with ground-truth diagnoses, and interobserver consistency was measured using Fleiss' kappa. Pairwise comparisons were performed with the Wilcoxon signed-rank test. Additionally, the same set of cases was evaluated by nine radiology residents to benchmark ChatGPT-4o's performance against human diagnostic accuracy.</p><p><strong>Results: </strong>ChatGPT-4o's diagnostic accuracy was lowest for \"image only\" (19.90%) and \"options only\" (20.67%) conditions. The highest accuracy was observed in \"image + clinical information + options\" (80.88%) and \"clinical information + options\" (75.45%) conditions. The highest interobserver agreement was observed in the \"image + clinical information + options\" condition (κ = 0.733) and the lowest was in the \"options only\" condition (κ = 0.023), suggesting that more information improves consistency. However, there was no effective benefit of adding imaging data over already provided clinical data and options, as seen in post-hoc analysis. In human comparison, ChatGPT-4o outperformed radiology residents in text-based configurations (75.45% vs. 42.89%), whereas residents showed slightly better performance in image-based tasks (64.13% vs. 61.24%). Notably, when residents were allowed to use ChatGPT-4o as a support tool, their image-based diagnostic accuracy increased from 63.04% to 74.16%.</p><p><strong>Conclusion: </strong>ChatGPT-4o performs well when provided with rich textual input but remains limited in purely image- based diagnoses. Its accuracy and consistency increase with multimodal input, yet adding imaging does not significantly improve performance beyond clinical context and diagnostic options alone. The model's superior performance to residents in text-based tasks underscores its potential as a diagnostic aid in structured scenarios. Furthermore, its integration as a support tool may enhance human diagnostic accuracy, particularly in image-based interpretation.</p><p><strong>Clinical significance: </strong>Although ChatGPT-4o is not yet capable of reliably interpreting radiologic images on its own, it demonstrates strong performance in text-based diagnostic reasoning. Its integration into clinical workflows-particularly for triage, structured decision support, or educational purposes-may augment radiologists' diagnostic capacity and consistency.</p>","PeriodicalId":11341,"journal":{"name":"Diagnostic and interventional radiology","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Diagnostic accuracy and consistency of ChatGPT-4o in radiology: influence of image, clinical data, and answer options on performance.\",\"authors\":\"Kadir Atakır, Kaan Işın, Abdullah Taş, Hakan Önder\",\"doi\":\"10.4274/dir.2025.253460\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>This study aimed to evaluate the diagnostic accuracy of Chat Generative Pre-trained Transformer (ChatGPT) version 4 Omni (ChatGPT-4o) in radiology across seven information input combinations (image, clinical data, and multiple-choice options) to assess the consistency of its outputs across repeated trials and to compare its performance with that of human radiologists.</p><p><strong>Methods: </strong>We tested 129 distinct radiology cases under seven input conditions (varying presence of imaging, clinical context, and answer options). Each case was processed by ChatGPT-4o for seven different input combinations on three separate accounts. Diagnostic accuracy was determined by comparison with ground-truth diagnoses, and interobserver consistency was measured using Fleiss' kappa. Pairwise comparisons were performed with the Wilcoxon signed-rank test. Additionally, the same set of cases was evaluated by nine radiology residents to benchmark ChatGPT-4o's performance against human diagnostic accuracy.</p><p><strong>Results: </strong>ChatGPT-4o's diagnostic accuracy was lowest for \\\"image only\\\" (19.90%) and \\\"options only\\\" (20.67%) conditions. The highest accuracy was observed in \\\"image + clinical information + options\\\" (80.88%) and \\\"clinical information + options\\\" (75.45%) conditions. The highest interobserver agreement was observed in the \\\"image + clinical information + options\\\" condition (κ = 0.733) and the lowest was in the \\\"options only\\\" condition (κ = 0.023), suggesting that more information improves consistency. However, there was no effective benefit of adding imaging data over already provided clinical data and options, as seen in post-hoc analysis. In human comparison, ChatGPT-4o outperformed radiology residents in text-based configurations (75.45% vs. 42.89%), whereas residents showed slightly better performance in image-based tasks (64.13% vs. 61.24%). Notably, when residents were allowed to use ChatGPT-4o as a support tool, their image-based diagnostic accuracy increased from 63.04% to 74.16%.</p><p><strong>Conclusion: </strong>ChatGPT-4o performs well when provided with rich textual input but remains limited in purely image- based diagnoses. Its accuracy and consistency increase with multimodal input, yet adding imaging does not significantly improve performance beyond clinical context and diagnostic options alone. The model's superior performance to residents in text-based tasks underscores its potential as a diagnostic aid in structured scenarios. Furthermore, its integration as a support tool may enhance human diagnostic accuracy, particularly in image-based interpretation.</p><p><strong>Clinical significance: </strong>Although ChatGPT-4o is not yet capable of reliably interpreting radiologic images on its own, it demonstrates strong performance in text-based diagnostic reasoning. Its integration into clinical workflows-particularly for triage, structured decision support, or educational purposes-may augment radiologists' diagnostic capacity and consistency.</p>\",\"PeriodicalId\":11341,\"journal\":{\"name\":\"Diagnostic and interventional radiology\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2025-09-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Diagnostic and interventional radiology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.4274/dir.2025.253460\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Diagnostic and interventional radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.4274/dir.2025.253460","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
Diagnostic accuracy and consistency of ChatGPT-4o in radiology: influence of image, clinical data, and answer options on performance.
Purpose: This study aimed to evaluate the diagnostic accuracy of Chat Generative Pre-trained Transformer (ChatGPT) version 4 Omni (ChatGPT-4o) in radiology across seven information input combinations (image, clinical data, and multiple-choice options) to assess the consistency of its outputs across repeated trials and to compare its performance with that of human radiologists.
Methods: We tested 129 distinct radiology cases under seven input conditions (varying presence of imaging, clinical context, and answer options). Each case was processed by ChatGPT-4o for seven different input combinations on three separate accounts. Diagnostic accuracy was determined by comparison with ground-truth diagnoses, and interobserver consistency was measured using Fleiss' kappa. Pairwise comparisons were performed with the Wilcoxon signed-rank test. Additionally, the same set of cases was evaluated by nine radiology residents to benchmark ChatGPT-4o's performance against human diagnostic accuracy.
Results: ChatGPT-4o's diagnostic accuracy was lowest for "image only" (19.90%) and "options only" (20.67%) conditions. The highest accuracy was observed in "image + clinical information + options" (80.88%) and "clinical information + options" (75.45%) conditions. The highest interobserver agreement was observed in the "image + clinical information + options" condition (κ = 0.733) and the lowest was in the "options only" condition (κ = 0.023), suggesting that more information improves consistency. However, there was no effective benefit of adding imaging data over already provided clinical data and options, as seen in post-hoc analysis. In human comparison, ChatGPT-4o outperformed radiology residents in text-based configurations (75.45% vs. 42.89%), whereas residents showed slightly better performance in image-based tasks (64.13% vs. 61.24%). Notably, when residents were allowed to use ChatGPT-4o as a support tool, their image-based diagnostic accuracy increased from 63.04% to 74.16%.
Conclusion: ChatGPT-4o performs well when provided with rich textual input but remains limited in purely image- based diagnoses. Its accuracy and consistency increase with multimodal input, yet adding imaging does not significantly improve performance beyond clinical context and diagnostic options alone. The model's superior performance to residents in text-based tasks underscores its potential as a diagnostic aid in structured scenarios. Furthermore, its integration as a support tool may enhance human diagnostic accuracy, particularly in image-based interpretation.
Clinical significance: Although ChatGPT-4o is not yet capable of reliably interpreting radiologic images on its own, it demonstrates strong performance in text-based diagnostic reasoning. Its integration into clinical workflows-particularly for triage, structured decision support, or educational purposes-may augment radiologists' diagnostic capacity and consistency.
期刊介绍:
Diagnostic and Interventional Radiology (Diagn Interv Radiol) is the open access, online-only official publication of Turkish Society of Radiology. It is published bimonthly and the journal’s publication language is English.
The journal is a medium for original articles, reviews, pictorial essays, technical notes related to all fields of diagnostic and interventional radiology.