Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination.

IF 1.6 4区医学

Journal of Orthopaedic Surgery Pub Date : 2025-01-01 DOI:10.1177/10225536241268789

Andrew Y Xu, Manjot Singh, Mariah Balmaceno-Criss, Allison Oh, David Leigh, Mohammad Daher, Daniel Alsoof, Christopher L McDonald, Bassel G Diebo, Alan H Daniels

{"title":"Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination.","authors":"Andrew Y Xu, Manjot Singh, Mariah Balmaceno-Criss, Allison Oh, David Leigh, Mohammad Daher, Daniel Alsoof, Christopher L McDonald, Bassel G Diebo, Alan H Daniels","doi":"10.1177/10225536241268789","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown.Methods: Three LLMs, OpenAI's GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions.Results: GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 (p = .149, p = .502, and p = .818, respectively) and outperforming GPT-3.5 and Bard (p < .001 and p = .001, respectively). While GPT-3.5 and Bard did not meet the passing threshold for the exam, GPT-3.5 performed at the level of PGY-1 to PGY-2 (p = .368 and p = .019, respectively) and Bard performed at the level of PGY-1 to PGY-3 (p = .440, .498, and 0.036, respectively). GPT-4 outperformed both Bard and GPT-3.5 on image-associated (p = .003 and p < .001, respectively) and higher-order questions (p < .001). Among the 11 subject categories, all models performed similarly regardless of the subject matter. When individual LLM performance on higher-order questions was assessed, no significant differences were found compared to performance on first order questions (GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319). Finally, when individual model performance was assessed on image-associated questions, only GPT-3.5 performed significantly worse compared to performance on non-image-associated questions (p = .045).Conclusion: The AI-based LLM GPT-4, exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum score for the 2022 OITE, and outperforming predecessor GPT-3.5 and Google Bard.","PeriodicalId":16608,"journal":{"name":"Journal of Orthopaedic Surgery","volume":"33 1","pages":"10225536241268789"},"PeriodicalIF":1.6000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Orthopaedic Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/10225536241268789","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown.

Methods: Three LLMs, OpenAI's GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions.

Results: GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 (p = .149, p = .502, and p = .818, respectively) and outperforming GPT-3.5 and Bard (p < .001 and p = .001, respectively). While GPT-3.5 and Bard did not meet the passing threshold for the exam, GPT-3.5 performed at the level of PGY-1 to PGY-2 (p = .368 and p = .019, respectively) and Bard performed at the level of PGY-1 to PGY-3 (p = .440, .498, and 0.036, respectively). GPT-4 outperformed both Bard and GPT-3.5 on image-associated (p = .003 and p < .001, respectively) and higher-order questions (p < .001). Among the 11 subject categories, all models performed similarly regardless of the subject matter. When individual LLM performance on higher-order questions was assessed, no significant differences were found compared to performance on first order questions (GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319). Finally, when individual model performance was assessed on image-associated questions, only GPT-3.5 performed significantly worse compared to performance on non-image-associated questions (p = .045).

Conclusion: The AI-based LLM GPT-4, exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum score for the 2022 OITE, and outperforming predecessor GPT-3.5 and Google Bard.

查看原文本刊更多论文

基于人工智能的大型语言模型在骨科培训考试中的比较表现。

背景：大型语言模型（LLMs）具有许多临床应用。然而，不同法学硕士在骨科板风格问题上的比较表现在很大程度上仍然未知。方法：对OpenAI的GPT-4和GPT-3.5以及b谷歌Bard三个llm进行了189道2022年骨科在职考试（OITE）官方试题的测试。进行比较分析，以评估他们的表现与骨科住院医师得分和高阶，图像相关，和主题类别特定的问题。结果：GPT-4超过了2022年OITE的通过门槛，表现在pgp -3至pgp -5水平（p = .149, p = .502, p = .818），优于GPT-3.5和Bard （p < .001, p = .001）。GPT-3.5和Bard均未达到考试合格门槛，但GPT-3.5在PGY-1至PGY-2水平（p = .368和p = .019）， Bard在PGY-1至PGY-3水平（p = .440, .498和0.036）。GPT-4在图像相关问题（p = 0.003和p < 0.001）和高阶问题（p < 0.001）上的表现优于Bard和GPT-3.5。在11个主题类别中，无论主题如何，所有模型的表现都相似。当评估LLM个体在高阶问题上的表现时，与在一阶问题上的表现相比，没有发现显著差异（GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319）。最后，当评估个体模型在图像相关问题上的表现时，只有GPT-3.5的表现明显低于非图像相关问题的表现（p = 0.045）。结论：基于人工智能的LLM GPT-4在正确回答各种OITE问题方面表现出强大的能力，超过了2022年OITE的最低分数，超过了前身GPT-3.5和b谷歌巴德。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Orthopaedic Surgery Medicine-Surgery

自引率

0.00%

发文量

期刊介绍： Journal of Orthopaedic Surgery is an open access peer-reviewed journal publishing original reviews and research articles on all aspects of orthopaedic surgery. It is the official journal of the Asia Pacific Orthopaedic Association. The journal welcomes and will publish materials of a diverse nature, from basic science research to clinical trials and surgical techniques. The journal encourages contributions from all parts of the world, but special emphasis is given to research of particular relevance to the Asia Pacific region.