Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.

IF 3.6 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Jakub Pristoupil, Laura Oleaga, Vanesa Junquero, Cristina Merino, Suha Sureyya Ozbek, Lukas Lambert
{"title":"Five advanced chatbots solving European Diploma in Radiology (EDiR) text-based questions: differences in performance and consistency.","authors":"Jakub Pristoupil, Laura Oleaga, Vanesa Junquero, Cristina Merino, Suha Sureyya Ozbek, Lukas Lambert","doi":"10.1186/s41747-025-00591-0","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>We compared the performance, confidence, and response consistency of five chatbots powered by large language models in solving European Diploma in Radiology (EDiR) text-based multiple-response questions.</p><p><strong>Methods: </strong>ChatGPT-4o, ChatGPT-4o-mini, Copilot, Gemini, and Claude 3.5 Sonnet were tested using 52 text-based multiple-response questions from two previous EDiR sessions in two iterations. Chatbots were prompted to evaluate each answer as correct or incorrect and grade its confidence level on a scale of 0 (not confident at all) to 10 (most confident). Scores per question were calculated using a weighted formula that accounted for correct and incorrect answers (range 0.0-1.0).</p><p><strong>Results: </strong>Claude 3.5 Sonnet achieved the highest score per question (0.84 ± 0.26, mean ± standard deviation) compared to ChatGPT-4o (0.76 ± 0.31), ChatGPT-4o-mini (0.64 ± 0.35), Copilot (0.62 ± 0.37), and Gemini (0.54 ± 0.39) (p < 0.001). A self-reported confidence in answering the questions was 9.0 ± 0.9 for Claude 3.5 Sonnet followed by ChatGPT-4o (8.7 ± 1.1), compared to ChatGPT-4o-mini (8.2 ± 1.3), Copilot (8.2 ± 2.2), and Gemini (8.2 ± 1.6, p < 0.001). Claude 3.5 Sonnet demonstrated superior consistency, changing responses in 5.4% of cases between the two iterations, compared to ChatGPT-4o (6.5%), ChatGPT-4o-mini (8.8%), Copilot (13.8%), and Gemini (18.5%). All chatbots outperformed human candidates from previous EDiR sessions, achieving a passing grade from this part of the examination.</p><p><strong>Conclusion: </strong>Claude 3.5 Sonnet exhibited superior accuracy, confidence, and consistency, with ChatGPT-4o performing nearly as well. The variation in performance among the evaluated models was substantial.</p><p><strong>Relevance statement: </strong>Variation in performance, consistency, and confidence among chatbots in solving EDiR test-based questions highlights the need for cautious deployment, particularly in high-stakes clinical and educational settings.</p><p><strong>Key points: </strong>Claude 3.5 Sonnet outperformed other chatbots in accuracy and response consistency. ChatGPT-4o ranked second, showing strong but slightly less reliable performance. All chatbots surpassed EDiR candidates in text-based EDiR questions.</p>","PeriodicalId":36926,"journal":{"name":"European Radiology Experimental","volume":"9 1","pages":"79"},"PeriodicalIF":3.6000,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12364795/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Radiology Experimental","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/s41747-025-00591-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Background: We compared the performance, confidence, and response consistency of five chatbots powered by large language models in solving European Diploma in Radiology (EDiR) text-based multiple-response questions.

Methods: ChatGPT-4o, ChatGPT-4o-mini, Copilot, Gemini, and Claude 3.5 Sonnet were tested using 52 text-based multiple-response questions from two previous EDiR sessions in two iterations. Chatbots were prompted to evaluate each answer as correct or incorrect and grade its confidence level on a scale of 0 (not confident at all) to 10 (most confident). Scores per question were calculated using a weighted formula that accounted for correct and incorrect answers (range 0.0-1.0).

Results: Claude 3.5 Sonnet achieved the highest score per question (0.84 ± 0.26, mean ± standard deviation) compared to ChatGPT-4o (0.76 ± 0.31), ChatGPT-4o-mini (0.64 ± 0.35), Copilot (0.62 ± 0.37), and Gemini (0.54 ± 0.39) (p < 0.001). A self-reported confidence in answering the questions was 9.0 ± 0.9 for Claude 3.5 Sonnet followed by ChatGPT-4o (8.7 ± 1.1), compared to ChatGPT-4o-mini (8.2 ± 1.3), Copilot (8.2 ± 2.2), and Gemini (8.2 ± 1.6, p < 0.001). Claude 3.5 Sonnet demonstrated superior consistency, changing responses in 5.4% of cases between the two iterations, compared to ChatGPT-4o (6.5%), ChatGPT-4o-mini (8.8%), Copilot (13.8%), and Gemini (18.5%). All chatbots outperformed human candidates from previous EDiR sessions, achieving a passing grade from this part of the examination.

Conclusion: Claude 3.5 Sonnet exhibited superior accuracy, confidence, and consistency, with ChatGPT-4o performing nearly as well. The variation in performance among the evaluated models was substantial.

Relevance statement: Variation in performance, consistency, and confidence among chatbots in solving EDiR test-based questions highlights the need for cautious deployment, particularly in high-stakes clinical and educational settings.

Key points: Claude 3.5 Sonnet outperformed other chatbots in accuracy and response consistency. ChatGPT-4o ranked second, showing strong but slightly less reliable performance. All chatbots surpassed EDiR candidates in text-based EDiR questions.

Abstract Image

Abstract Image

Abstract Image

五个先进的聊天机器人解决欧洲放射学文凭(EDiR)基于文本的问题:性能和一致性的差异。
背景:我们比较了五种由大型语言模型驱动的聊天机器人在解决欧洲放射学文凭(EDiR)基于文本的多回答问题时的表现、置信度和响应一致性。方法:chatgpt - 40、chatgpt - 40 -mini、Copilot、Gemini和Claude 3.5 Sonnet采用前两次EDiR会话中的52个基于文本的多回答问题进行测试。聊天机器人被提示对每个答案进行正确或不正确的评估,并将其自信程度分为0(完全不自信)到10(最自信)。每个问题的得分是使用加权公式计算的,该公式考虑了正确和错误的答案(范围为0.0-1.0)。结果:与chatgpt - 40(0.76±0.31)、chatgpt - 40 -mini(0.64±0.35)、Copilot(0.62±0.37)和Gemini(0.54±0.39)相比,Claude 3.5 Sonnet在每个问题上的得分最高(0.84±0.26,平均±标准差)(p)。结论:Claude 3.5 Sonnet表现出更高的准确性、置信度和一致性,chatgpt - 40的表现几乎相同。在评估模型之间的性能差异是实质性的。相关性声明:聊天机器人在解决基于EDiR测试的问题时表现、一致性和信心的差异突出了谨慎部署的必要性,特别是在高风险的临床和教育环境中。重点:Claude 3.5 Sonnet在准确性和响应一致性方面优于其他聊天机器人。chatgpt - 40排名第二,表现强劲,但可靠性略差。在基于文本的EDiR问题中,所有聊天机器人都超过了EDiR候选人。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
European Radiology Experimental
European Radiology Experimental Medicine-Radiology, Nuclear Medicine and Imaging
CiteScore
6.70
自引率
2.60%
发文量
56
审稿时长
18 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信