Assessing the performance of ChatGPT-4o on the Turkish Orthopedics and Traumatology Board Examination.

IF 1.9 Q2 ORTHOPEDICS

Joint diseases and related surgery Pub Date : 2025-04-05 DOI:10.52312/jdrs.2025.1958

Hilal Yağar, Ender Gümüşoğlu, Zeynel Mert Asfuroğlu

{"title":"Assessing the performance of ChatGPT-4o on the Turkish Orthopedics and Traumatology Board Examination.","authors":"Hilal Yağar, Ender Gümüşoğlu, Zeynel Mert Asfuroğlu","doi":"10.52312/jdrs.2025.1958","DOIUrl":null,"url":null,"abstract":"Objectives: This study aims to assess the overall performance of ChatGPT version 4-omni (GPT-4o) on the Turkish Orthopedics and Traumatology Board Examination (TOTBE) using actual examinees as a reference point to evaluate and compare the performance of GPT-4o with that of human participants.Materials and methods: In this study, GPT-4o was tested with multiple-choice questions that formed the first step of 14 TOTBEs conducted between 2010 and 2023. The assessment of image-based questions was conducted separately for all exams. The questions were classified based on the subspecialties for the five exams (2010-2014). The performance of GPT-4o was assessed and compared to those of actual examinees of the TOTBE.Results: The mean total score of GPT-4o was 70.2±5.64 (range, 61 to 84), whereas that of actual examinees was 58±3.28 (range, 53.6 to 64.6). Considering accuracy rates, GPT-4o demonstrated 62% accuracy on image-based questions and 70% accuracy on text-based questions. It also demonstrated superior performance in the field of basic sciences, whereas actual examinees performed better in the specialty of reconstruction. Both GPT-4o and actual examinees exhibited the lowest scores in the subspecialty of lower extremity and foot.Conclusion: Our study results showed that GPT-4o performed well on the TOTBE, particularly in basic sciences. While it demonstrated accuracy comparable to actual examinees in some areas, these findings highlight its potential as a helpful tool in medical education.","PeriodicalId":73560,"journal":{"name":"Joint diseases and related surgery","volume":"36 2","pages":"304-310"},"PeriodicalIF":1.9000,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12086493/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Joint diseases and related surgery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.52312/jdrs.2025.1958","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: This study aims to assess the overall performance of ChatGPT version 4-omni (GPT-4o) on the Turkish Orthopedics and Traumatology Board Examination (TOTBE) using actual examinees as a reference point to evaluate and compare the performance of GPT-4o with that of human participants.

Materials and methods: In this study, GPT-4o was tested with multiple-choice questions that formed the first step of 14 TOTBEs conducted between 2010 and 2023. The assessment of image-based questions was conducted separately for all exams. The questions were classified based on the subspecialties for the five exams (2010-2014). The performance of GPT-4o was assessed and compared to those of actual examinees of the TOTBE.

Results: The mean total score of GPT-4o was 70.2±5.64 (range, 61 to 84), whereas that of actual examinees was 58±3.28 (range, 53.6 to 64.6). Considering accuracy rates, GPT-4o demonstrated 62% accuracy on image-based questions and 70% accuracy on text-based questions. It also demonstrated superior performance in the field of basic sciences, whereas actual examinees performed better in the specialty of reconstruction. Both GPT-4o and actual examinees exhibited the lowest scores in the subspecialty of lower extremity and foot.

Conclusion: Our study results showed that GPT-4o performed well on the TOTBE, particularly in basic sciences. While it demonstrated accuracy comparable to actual examinees in some areas, these findings highlight its potential as a helpful tool in medical education.

Abstract Image

查看原文本刊更多论文

评估chatgpt - 40在土耳其骨科和创伤学委员会考试中的表现。

目的：本研究旨在评估ChatGPT version 4-omni （gpt - 40）在土耳其骨科和创伤学委员会考试（TOTBE）中的整体表现，以实际考生为参考点，评估和比较gpt - 40与人类参与者的表现。材料和方法：在本研究中，gpt - 40采用多项选择题进行测试，这些选择题构成了2010年至2023年期间进行的14次TOTBEs的第一步。所有考试的图像题评估都是单独进行的。题目根据2010-2014年五次考试的子专业进行分类。gpt - 40的表现被评估，并与那些实际的tobe考生进行比较。结果：gpt - 40平均总分为70.2±5.64分（范围61 ~ 84分），实际考生平均总分为58±3.28分（范围53.6 ~ 64.6分）。考虑到准确率，gpt - 40在基于图像的问题上的准确率为62%，在基于文本的问题上的准确率为70%。它在基础科学领域也表现出优异的表现，而实际考生在重建专业表现更好。gpt - 40和实际考生在下肢和足部亚专科得分最低。结论：我们的研究结果表明gpt - 40在tobe上表现良好，特别是在基础科学方面。虽然它在某些领域显示出与实际考生相当的准确性，但这些发现突出了它作为医学教育有用工具的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Joint diseases and related surgery

CiteScore

2.50

自引率

0.00%

发文量