Artificial intelligence in orthopaedic education: A comparative analysis of ChatGPT and Bing AI's Orthopaedic In-Training Examination performance

Medicine Advances Pub Date : 2024-09-15 DOI:10.1002/med4.77

Clark J. Chen, Vivek K. Bilolikar, Duncan VanNest, James Raphael, Gene Shaffer

{"title":"Artificial intelligence in orthopaedic education: A comparative analysis of ChatGPT and Bing AI's Orthopaedic In-Training Examination performance","authors":"Clark J. Chen, Vivek K. Bilolikar, Duncan VanNest, James Raphael, Gene Shaffer","doi":"10.1002/med4.77","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>This study evaluated the performance of generative artificial intelligence (AI) models on the Orthopaedic In-Training Examination (OITE), an annual exam administered to U.S. orthopaedic residency programs.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>ChatGPT 3.5 and Bing AI GPT 4.0 were evaluated on standardised sets of multiple-choice questions drawn from the American Academy of Orthopaedic Surgeons OITE online question bank spanning 5 years (2018–2022). A total of 1165 questions were posed to each AI system. The performance of both systems was standardised using the latest versions of ChatGPT 3.5 and Bing AI GPT 4.0. Historical data of resident scores taken from the annual OITE technical reports was used as a comparison.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Across the five datasets, ChatGPT 3.5 scored an average of 55.0% on the OITE questions. Bing AI GPT 4.0 scored higher with an average of 80.0%. In comparison, the average performance of orthopaedic residents in national accredited programs was 62.1%. Bing AI GPT 4.0 outperformed ChatGPT 3.5 and Accreditation Council for Graduate Medical Education examinees, and analysis of variance analysis demonstrated <i>p</i> < 0.001 among groups. The best performance was by Bing AI GPT 4.0 on OITE 2020.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>Generative AI can provide a logical context across answer responses through its in-depth information searches and citation of resources. This combination presents a convincing argument for the possible uses of AI in medical education as an interactive learning aid.</p>\n </section>\n </div>","PeriodicalId":100913,"journal":{"name":"Medicine Advances","volume":"2 3","pages":"284-290"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/med4.77","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medicine Advances","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/med4.77","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background

This study evaluated the performance of generative artificial intelligence (AI) models on the Orthopaedic In-Training Examination (OITE), an annual exam administered to U.S. orthopaedic residency programs.

Methods

ChatGPT 3.5 and Bing AI GPT 4.0 were evaluated on standardised sets of multiple-choice questions drawn from the American Academy of Orthopaedic Surgeons OITE online question bank spanning 5 years (2018–2022). A total of 1165 questions were posed to each AI system. The performance of both systems was standardised using the latest versions of ChatGPT 3.5 and Bing AI GPT 4.0. Historical data of resident scores taken from the annual OITE technical reports was used as a comparison.

Results

Across the five datasets, ChatGPT 3.5 scored an average of 55.0% on the OITE questions. Bing AI GPT 4.0 scored higher with an average of 80.0%. In comparison, the average performance of orthopaedic residents in national accredited programs was 62.1%. Bing AI GPT 4.0 outperformed ChatGPT 3.5 and Accreditation Council for Graduate Medical Education examinees, and analysis of variance analysis demonstrated p < 0.001 among groups. The best performance was by Bing AI GPT 4.0 on OITE 2020.

Conclusion

Generative AI can provide a logical context across answer responses through its in-depth information searches and citation of resources. This combination presents a convincing argument for the possible uses of AI in medical education as an interactive learning aid.

Abstract Image

查看原文本刊更多论文

人工智能在骨科教育中的应用：ChatGPT 和 Bing AI 的骨科在训考试成绩对比分析

背景本研究评估了生成式人工智能（AI）模型在骨科住院医师培训考试（OITE）中的表现，OITE 是美国骨科住院医师培训项目的年度考试。方法对 ChatGPT 3.5 和 Bing AI GPT 4.0 进行了评估，评估对象是从美国骨科外科医生学会 OITE 在线题库中抽取的标准化选择题集，时间跨度为 5 年（2018-2022 年）。每个人工智能系统共收到 1165 个问题。两个系统的性能均使用 ChatGPT 3.5 和 Bing AI GPT 4.0 的最新版本进行标准化。居民分数的历史数据取自年度 OITE 技术报告，用于比较。结果在五个数据集中，ChatGPT 3.5 在 OITE 问题上的平均得分率为 55.0%。Bing AI GPT 4.0 的平均得分更高，达到 80.0%。相比之下，国家认证项目中骨科住院医师的平均成绩为 62.1%。Bing AI GPT 4.0 的表现优于 ChatGPT 3.5 和美国毕业后医学教育认证委员会的考生，方差分析显示各组间的 p < 0.001。Bing AI GPT 4.0 在 OITE 2020 上的表现最佳。结论生成式人工智能可以通过深入的信息搜索和资源引用，为整个答案回答提供逻辑背景。这种组合为在医学教育中使用人工智能作为互动学习辅助工具提供了令人信服的论据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medicine Advances

自引率

0.00%

发文量