Generative pre-trained transformer 4o (GPT-4o) in solving text-based multiple response questions for European Diploma in Radiology (EDiR): a comparative study with radiologists.

IF 4.5 2区医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

Insights into Imaging Pub Date : 2025-03-22 DOI:10.1186/s13244-025-01941-7

Jakub Pristoupil, Laura Oleaga, Vanesa Junquero, Cristina Merino, Ozbek Suha Sureyya, Martin Kyncl, Andrea Burgetova, Lukas Lambert

{"title":"Generative pre-trained transformer 4o (GPT-4o) in solving text-based multiple response questions for European Diploma in Radiology (EDiR): a comparative study with radiologists.","authors":"Jakub Pristoupil, Laura Oleaga, Vanesa Junquero, Cristina Merino, Ozbek Suha Sureyya, Martin Kyncl, Andrea Burgetova, Lukas Lambert","doi":"10.1186/s13244-025-01941-7","DOIUrl":null,"url":null,"abstract":"Objectives: This study aims to assess the accuracy of generative pre-trained transformer 4o (GPT-4o) in answering multiple response questions from the European Diploma in Radiology (EDiR) examination, comparing its performance to that of human candidates.Materials and methods: Results from 42 EDiR candidates across Europe were compared to those from 26 fourth-year medical students who answered exclusively using the ChatGPT-4o in a prospective study (October 2024). The challenge consisted of 52 recall or understanding-based EDiR multiple-response questions, all without visual inputs.Results: The GPT-4o achieved a mean score of 82.1 ± 3.0%, significantly outperforming the EDiR candidates with 49.4 ± 10.5% (p < 0.0001). In particular, chatGPT-4o demonstrated higher true positive rates while maintaining lower false positive rates compared to EDiR candidates, with a higher accuracy rate in all radiology subspecialties (p < 0.0001) except informatics (p = 0.20). There was near-perfect agreement between GPT-4 responses (κ = 0.872) and moderate agreement among EDiR participants (κ = 0.334). Exit surveys revealed that all participants used the copy-and-paste feature, and 73% submitted additional questions to clarify responses.Conclusions: GPT-4o significantly outperformed human candidates in low-order, text-based EDiR multiple-response questions, demonstrating higher accuracy and reliability. These results highlight GPT-4o's potential in answering text-based radiology questions. Further research is necessary to investigate its performance across different question formats and candidate populations to ensure broader applicability and reliability.Critical relevance statement: GPT-4o significantly outperforms human candidates in factual radiology text-based questions in the EDiR, excelling especially in identifying correct responses, with a higher accuracy rate compared to radiologists.Key points: In EDiR text-based questions, ChatGPT-4o scored higher (82%) than EDiR participants (49%). Compared to radiologists, GPT-4o excelled in identifying correct responses. GPT-4o responses demonstrated higher agreement (κ = 0.87) compared to EDiR candidates (κ = 0.33).","PeriodicalId":13639,"journal":{"name":"Insights into Imaging","volume":"16 1","pages":"66"},"PeriodicalIF":4.5000,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11929644/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Insights into Imaging","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s13244-025-01941-7","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: This study aims to assess the accuracy of generative pre-trained transformer 4o (GPT-4o) in answering multiple response questions from the European Diploma in Radiology (EDiR) examination, comparing its performance to that of human candidates.

Materials and methods: Results from 42 EDiR candidates across Europe were compared to those from 26 fourth-year medical students who answered exclusively using the ChatGPT-4o in a prospective study (October 2024). The challenge consisted of 52 recall or understanding-based EDiR multiple-response questions, all without visual inputs.

Results: The GPT-4o achieved a mean score of 82.1 ± 3.0%, significantly outperforming the EDiR candidates with 49.4 ± 10.5% (p < 0.0001). In particular, chatGPT-4o demonstrated higher true positive rates while maintaining lower false positive rates compared to EDiR candidates, with a higher accuracy rate in all radiology subspecialties (p < 0.0001) except informatics (p = 0.20). There was near-perfect agreement between GPT-4 responses (κ = 0.872) and moderate agreement among EDiR participants (κ = 0.334). Exit surveys revealed that all participants used the copy-and-paste feature, and 73% submitted additional questions to clarify responses.

Conclusions: GPT-4o significantly outperformed human candidates in low-order, text-based EDiR multiple-response questions, demonstrating higher accuracy and reliability. These results highlight GPT-4o's potential in answering text-based radiology questions. Further research is necessary to investigate its performance across different question formats and candidate populations to ensure broader applicability and reliability.

Critical relevance statement: GPT-4o significantly outperforms human candidates in factual radiology text-based questions in the EDiR, excelling especially in identifying correct responses, with a higher accuracy rate compared to radiologists.

Key points: In EDiR text-based questions, ChatGPT-4o scored higher (82%) than EDiR participants (49%). Compared to radiologists, GPT-4o excelled in identifying correct responses. GPT-4o responses demonstrated higher agreement (κ = 0.87) compared to EDiR candidates (κ = 0.33).

查看原文本刊更多论文

生成式预训练转换器 4o (GPT-4o) 在解决欧洲放射学文凭（EDiR）基于文本的多重回答问题中的应用：与放射科医生的比较研究。

目的：本研究旨在评估生成式预训练变压器40 （gpt - 40）在回答欧洲放射学文凭（EDiR）考试中的多个回答问题时的准确性，并将其表现与人类候选人进行比较。材料和方法：在一项前瞻性研究（2024年10月）中，将欧洲42名EDiR候选人的结果与26名仅使用chatgpt - 40回答的四年级医学生的结果进行了比较。挑战包括52个基于回忆或理解的EDiR多回答问题，所有问题都没有视觉输入。结果：gpt - 40的平均得分为82.1±3.0%，显著优于EDiR候选人的49.4±10.5% (p)。结论：gpt - 40在低阶、基于文本的EDiR多回答问题上显著优于人类候选人，显示出更高的准确性和可靠性。这些结果突出了gpt - 40在回答基于文本的放射学问题方面的潜力。为了确保更广泛的适用性和可靠性，有必要进一步研究其在不同问题格式和候选人群中的表现。关键相关性声明：gpt - 40在EDiR中基于事实放射学文本的问题上明显优于人类候选人，特别是在识别正确答案方面表现出色，与放射科医生相比准确率更高。关键点：在基于EDiR文本的问题中，chatgpt - 40得分（82%）高于EDiR参与者（49%）。与放射科医生相比，gpt - 40在识别正确反应方面表现出色。与EDiR候选组（κ = 0.33）相比，gpt - 40反应表现出更高的一致性（κ = 0.87）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Insights into Imaging Medicine-Radiology, Nuclear Medicine and Imaging

CiteScore

7.30

自引率

4.30%

发文量

182

审稿时长

13 weeks

期刊介绍： Insights into Imaging (I³) is a peer-reviewed open access journal published under the brand SpringerOpen. All content published in the journal is freely available online to anyone, anywhere! I³ continuously updates scientific knowledge and progress in best-practice standards in radiology through the publication of original articles and state-of-the-art reviews and opinions, along with recommendations and statements from the leading radiological societies in Europe. Founded by the European Society of Radiology (ESR), I³ creates a platform for educational material, guidelines and recommendations, and a forum for topics of controversy. A balanced combination of review articles, original papers, short communications from European radiological congresses and information on society matters makes I³ an indispensable source for current information in this field. I³ is owned by the ESR, however authors retain copyright to their article according to the Creative Commons Attribution License (see Copyright and License Agreement). All articles can be read, redistributed and reused for free, as long as the author of the original work is cited properly. The open access fees (article-processing charges) for this journal are kindly sponsored by ESR for all Members. The journal went open access in 2012, which means that all articles published since then are freely available online.