Andrew Y Xu, Manjot Singh, Mariah Balmaceno-Criss, Allison Oh, David Leigh, Mohammad Daher, Daniel Alsoof, Christopher L McDonald, Bassel G Diebo, Alan H Daniels
{"title":"Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination.","authors":"Andrew Y Xu, Manjot Singh, Mariah Balmaceno-Criss, Allison Oh, David Leigh, Mohammad Daher, Daniel Alsoof, Christopher L McDonald, Bassel G Diebo, Alan H Daniels","doi":"10.1177/10225536241268789","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown.</p><p><strong>Methods: </strong>Three LLMs, OpenAI's GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions.</p><p><strong>Results: </strong>GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 (<i>p</i> = .149, <i>p</i> = .502, and <i>p</i> = .818, respectively) and outperforming GPT-3.5 and Bard (<i>p</i> < .001 and <i>p</i> = .001, respectively). While GPT-3.5 and Bard did not meet the passing threshold for the exam, GPT-3.5 performed at the level of PGY-1 to PGY-2 (<i>p</i> = .368 and <i>p</i> = .019, respectively) and Bard performed at the level of PGY-1 to PGY-3 (<i>p</i> = .440, .498, and 0.036, respectively). GPT-4 outperformed both Bard and GPT-3.5 on image-associated (<i>p</i> = .003 and <i>p</i> < .001, respectively) and higher-order questions (<i>p</i> < .001). Among the 11 subject categories, all models performed similarly regardless of the subject matter. When individual LLM performance on higher-order questions was assessed, no significant differences were found compared to performance on first order questions (GPT-4 <i>p</i> = .139, GPT-3.5 <i>p</i> = .124, Bard <i>p</i> = .319). Finally, when individual model performance was assessed on image-associated questions, only GPT-3.5 performed significantly worse compared to performance on non-image-associated questions (<i>p</i> = .045).</p><p><strong>Conclusion: </strong>The AI-based LLM GPT-4, exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum <b>score for the 2022 OITE</b>, and outperforming predecessor GPT-3.5 and Google Bard.</p>","PeriodicalId":16608,"journal":{"name":"Journal of Orthopaedic Surgery","volume":"33 1","pages":"10225536241268789"},"PeriodicalIF":1.6000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Orthopaedic Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/10225536241268789","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown.
Methods: Three LLMs, OpenAI's GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions.
Results: GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 (p = .149, p = .502, and p = .818, respectively) and outperforming GPT-3.5 and Bard (p < .001 and p = .001, respectively). While GPT-3.5 and Bard did not meet the passing threshold for the exam, GPT-3.5 performed at the level of PGY-1 to PGY-2 (p = .368 and p = .019, respectively) and Bard performed at the level of PGY-1 to PGY-3 (p = .440, .498, and 0.036, respectively). GPT-4 outperformed both Bard and GPT-3.5 on image-associated (p = .003 and p < .001, respectively) and higher-order questions (p < .001). Among the 11 subject categories, all models performed similarly regardless of the subject matter. When individual LLM performance on higher-order questions was assessed, no significant differences were found compared to performance on first order questions (GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319). Finally, when individual model performance was assessed on image-associated questions, only GPT-3.5 performed significantly worse compared to performance on non-image-associated questions (p = .045).
Conclusion: The AI-based LLM GPT-4, exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum score for the 2022 OITE, and outperforming predecessor GPT-3.5 and Google Bard.
期刊介绍:
Journal of Orthopaedic Surgery is an open access peer-reviewed journal publishing original reviews and research articles on all aspects of orthopaedic surgery. It is the official journal of the Asia Pacific Orthopaedic Association.
The journal welcomes and will publish materials of a diverse nature, from basic science research to clinical trials and surgical techniques. The journal encourages contributions from all parts of the world, but special emphasis is given to research of particular relevance to the Asia Pacific region.