ChatGPT, Bard, and Bing Chat Are Large Language Processing Models That Answered Orthopaedic In-Training Examination Questions With Similar Accuracy to First-Year Orthopaedic Surgery Residents

IF 4.4 1区医学 Q1 ORTHOPEDICS

Arthroscopy-The Journal of Arthroscopic and Related Surgery Pub Date : 2024-08-28 DOI:10.1016/j.arthro.2024.08.023

Gage A. Guerra B.A., Hayden L. Hofmann B.S., Jonathan L. Le B.S., M.S., Alexander M. Wong B.S., Amir Fathi B.S., Cory K. Mayfield M.D., Frank A. Petrigliano M.D., Joseph N. Liu M.D.

{"title":"ChatGPT, Bard, and Bing Chat Are Large Language Processing Models That Answered Orthopaedic In-Training Examination Questions With Similar Accuracy to First-Year Orthopaedic Surgery Residents","authors":"Gage A. Guerra B.A., Hayden L. Hofmann B.S., Jonathan L. Le B.S., M.S., Alexander M. Wong B.S., Amir Fathi B.S., Cory K. Mayfield M.D., Frank A. Petrigliano M.D., Joseph N. Liu M.D.","doi":"10.1016/j.arthro.2024.08.023","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To assess ChatGPT’s, Bard’s, and Bing Chat’s ability to generate accurate orthopaedic diagnoses or corresponding treatments by comparing their performance on the Orthopaedic In-Training Examination (OITE) with that of orthopaedic trainees.</div></div><div><h3>Methods</h3><div>OITE question sets from 2021 and 2022 were compiled to form a large set of 420 questions. ChatGPT (GPT-3.5), Bard, and Bing Chat were instructed to select one of the provided responses to each question. The accuracy of composite questions was recorded and comparatively analyzed to human cohorts including medical students and orthopaedic residents, stratified by postgraduate year (PGY).</div></div><div><h3>Results</h3><div>ChatGPT correctly answered 46.3% of composite questions whereas Bing Chat correctly answered 52.4% of questions and Bard correctly answered 51.4% of questions on the OITE. When image-associated questions were excluded, ChatGPT’s, Bing Chat’s, and Bard’s overall accuracies improved to 49.1%, 53.5%, and 56.8%, respectively. Medical students correctly answered 30.8%, and PGY-1, -2, -3, -4, and -5 orthopaedic residents correctly answered 53.1%, 60.4%, 66.6%, 70.0%, and 71.9%, respectively.</div></div><div><h3>Conclusions</h3><div>ChatGPT, Bard, and Bing Chat are artificial intelligence (AI) models that answered OITE questions with accuracy similar to that of first-year orthopaedic surgery residents. ChatGPT, Bard, and Bing Chat achieved this result without using images or other supplementary media that human test takers are provided.</div></div><div><h3>Clinical Relevance</h3><div>Our comparative performance analysis of AI models on orthopaedic board–style questions highlights ChatGPT’s, Bing Chat’s, and Bard’s clinical knowledge and proficiency. Our analysis establishes a baseline of AI model proficiency in the field of orthopaedics and provides a comparative marker for future, more advanced deep learning models. Although in its elementary phase, future AI models’ orthopaedic knowledge may provide clinical support and serve as an educational tool.</div></div>","PeriodicalId":55459,"journal":{"name":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","volume":"41 3","pages":"Pages 557-562"},"PeriodicalIF":4.4000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0749806324006212","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

To assess ChatGPT’s, Bard’s, and Bing Chat’s ability to generate accurate orthopaedic diagnoses or corresponding treatments by comparing their performance on the Orthopaedic In-Training Examination (OITE) with that of orthopaedic trainees.

Methods

OITE question sets from 2021 and 2022 were compiled to form a large set of 420 questions. ChatGPT (GPT-3.5), Bard, and Bing Chat were instructed to select one of the provided responses to each question. The accuracy of composite questions was recorded and comparatively analyzed to human cohorts including medical students and orthopaedic residents, stratified by postgraduate year (PGY).

Results

ChatGPT correctly answered 46.3% of composite questions whereas Bing Chat correctly answered 52.4% of questions and Bard correctly answered 51.4% of questions on the OITE. When image-associated questions were excluded, ChatGPT’s, Bing Chat’s, and Bard’s overall accuracies improved to 49.1%, 53.5%, and 56.8%, respectively. Medical students correctly answered 30.8%, and PGY-1, -2, -3, -4, and -5 orthopaedic residents correctly answered 53.1%, 60.4%, 66.6%, 70.0%, and 71.9%, respectively.

Conclusions

ChatGPT, Bard, and Bing Chat are artificial intelligence (AI) models that answered OITE questions with accuracy similar to that of first-year orthopaedic surgery residents. ChatGPT, Bard, and Bing Chat achieved this result without using images or other supplementary media that human test takers are provided.

Clinical Relevance

Our comparative performance analysis of AI models on orthopaedic board–style questions highlights ChatGPT’s, Bing Chat’s, and Bard’s clinical knowledge and proficiency. Our analysis establishes a baseline of AI model proficiency in the field of orthopaedics and provides a comparative marker for future, more advanced deep learning models. Although in its elementary phase, future AI models’ orthopaedic knowledge may provide clinical support and serve as an educational tool.

查看原文本刊更多论文

ChatGPT、Bard 和 Bing Chat 是大型语言处理模型，它们回答 OITE 问题的准确率与骨科外科一年级住院医师相近。

目的：通过比较 ChatGPT、Bard 和 BingChat 在骨科在训考试（OITE）中的表现，评估他们与骨科受训人员生成准确骨科诊断或相应治疗的能力：方法：将 2021 年和 2022 年的 OITE 题集汇编成一个包含 420 道题的大题集。在 ChatGPT (GPT3.5)、Bard 和 BingChat 的指导下，从提供的每个问题的答案中选择一个。综合问题的准确性被记录下来，并与人类群组（包括医学生和骨科住院医师）进行比较分析，按研究生年级进行分层：结果：ChatGPT 正确回答了 46.3% 的综合问题，而 BingChat 正确回答了 52.4% 的问题，Bard 正确回答了 51.4% 的 OITE 问题。排除图像相关问题后，ChatGPT、BingChat 和 Bard 的总体准确率分别提高到 49.1%、53.5% 和 56.8%。医学生和骨科住院医师（PGY1-5）的正确回答率分别为 30.8%、53.1%、60.4%、66.6%、70.0% 和 71.9%：结论：ChatGPT、Bard 和 BingChat 是人工智能模型，它们回答 OITE 问题的准确率与骨科外科一年级住院医师相似。ChatGPT、Bard 和 BingChat 在不使用人类应试者提供的图像或其他辅助媒体的情况下取得了这一结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Arthroscopy-The Journal of Arthroscopic and Related Surgery 医学-外科

CiteScore

9.30

自引率

17.00%

发文量

555

审稿时长

58 days

期刊介绍： Nowhere is minimally invasive surgery explained better than in Arthroscopy, the leading peer-reviewed journal in the field. Every issue enables you to put into perspective the usefulness of the various emerging arthroscopic techniques. The advantages and disadvantages of these methods -- along with their applications in various situations -- are discussed in relation to their efficiency, efficacy and cost benefit. As a special incentive, paid subscribers also receive access to the journal expanded website.