Assessing the performance of ChatGPT-4o on the Turkish Orthopedics and Traumatology Board Examination.

IF 1.9 Q2 ORTHOPEDICS
Hilal Yağar, Ender Gümüşoğlu, Zeynel Mert Asfuroğlu
{"title":"Assessing the performance of ChatGPT-4o on the Turkish Orthopedics and Traumatology Board Examination.","authors":"Hilal Yağar, Ender Gümüşoğlu, Zeynel Mert Asfuroğlu","doi":"10.52312/jdrs.2025.1958","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>This study aims to assess the overall performance of ChatGPT version 4-omni (GPT-4o) on the Turkish Orthopedics and Traumatology Board Examination (TOTBE) using actual examinees as a reference point to evaluate and compare the performance of GPT-4o with that of human participants.</p><p><strong>Materials and methods: </strong>In this study, GPT-4o was tested with multiple-choice questions that formed the first step of 14 TOTBEs conducted between 2010 and 2023. The assessment of image-based questions was conducted separately for all exams. The questions were classified based on the subspecialties for the five exams (2010-2014). The performance of GPT-4o was assessed and compared to those of actual examinees of the TOTBE.</p><p><strong>Results: </strong>The mean total score of GPT-4o was 70.2±5.64 (range, 61 to 84), whereas that of actual examinees was 58±3.28 (range, 53.6 to 64.6). Considering accuracy rates, GPT-4o demonstrated 62% accuracy on image-based questions and 70% accuracy on text-based questions. It also demonstrated superior performance in the field of basic sciences, whereas actual examinees performed better in the specialty of reconstruction. Both GPT-4o and actual examinees exhibited the lowest scores in the subspecialty of lower extremity and foot.</p><p><strong>Conclusion: </strong>Our study results showed that GPT-4o performed well on the TOTBE, particularly in basic sciences. While it demonstrated accuracy comparable to actual examinees in some areas, these findings highlight its potential as a helpful tool in medical education.</p>","PeriodicalId":73560,"journal":{"name":"Joint diseases and related surgery","volume":"36 2","pages":"304-310"},"PeriodicalIF":1.9000,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12086493/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Joint diseases and related surgery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52312/jdrs.2025.1958","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0

Abstract

Objectives: This study aims to assess the overall performance of ChatGPT version 4-omni (GPT-4o) on the Turkish Orthopedics and Traumatology Board Examination (TOTBE) using actual examinees as a reference point to evaluate and compare the performance of GPT-4o with that of human participants.

Materials and methods: In this study, GPT-4o was tested with multiple-choice questions that formed the first step of 14 TOTBEs conducted between 2010 and 2023. The assessment of image-based questions was conducted separately for all exams. The questions were classified based on the subspecialties for the five exams (2010-2014). The performance of GPT-4o was assessed and compared to those of actual examinees of the TOTBE.

Results: The mean total score of GPT-4o was 70.2±5.64 (range, 61 to 84), whereas that of actual examinees was 58±3.28 (range, 53.6 to 64.6). Considering accuracy rates, GPT-4o demonstrated 62% accuracy on image-based questions and 70% accuracy on text-based questions. It also demonstrated superior performance in the field of basic sciences, whereas actual examinees performed better in the specialty of reconstruction. Both GPT-4o and actual examinees exhibited the lowest scores in the subspecialty of lower extremity and foot.

Conclusion: Our study results showed that GPT-4o performed well on the TOTBE, particularly in basic sciences. While it demonstrated accuracy comparable to actual examinees in some areas, these findings highlight its potential as a helpful tool in medical education.

评估chatgpt - 40在土耳其骨科和创伤学委员会考试中的表现。
目的:本研究旨在评估ChatGPT version 4-omni (gpt - 40)在土耳其骨科和创伤学委员会考试(TOTBE)中的整体表现,以实际考生为参考点,评估和比较gpt - 40与人类参与者的表现。材料和方法:在本研究中,gpt - 40采用多项选择题进行测试,这些选择题构成了2010年至2023年期间进行的14次TOTBEs的第一步。所有考试的图像题评估都是单独进行的。题目根据2010-2014年五次考试的子专业进行分类。gpt - 40的表现被评估,并与那些实际的tobe考生进行比较。结果:gpt - 40平均总分为70.2±5.64分(范围61 ~ 84分),实际考生平均总分为58±3.28分(范围53.6 ~ 64.6分)。考虑到准确率,gpt - 40在基于图像的问题上的准确率为62%,在基于文本的问题上的准确率为70%。它在基础科学领域也表现出优异的表现,而实际考生在重建专业表现更好。gpt - 40和实际考生在下肢和足部亚专科得分最低。结论:我们的研究结果表明gpt - 40在tobe上表现良好,特别是在基础科学方面。虽然它在某些领域显示出与实际考生相当的准确性,但这些发现突出了它作为医学教育有用工具的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.50
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信