评估 ChatGPT 在骨科在训考试中的表现。

IF 2.3 Q2 ORTHOPEDICS

JBJS Open Access Pub Date : 2023-09-08 eCollection Date: 2023-07-01 DOI:10.2106/JBJS.OA.23.00056

Justin E Kung, Christopher Marshall, Chase Gauthier, Tyler A Gonzalez, J Benjamin Jackson

{"title":"评估 ChatGPT 在骨科在训考试中的表现。","authors":"Justin E Kung, Christopher Marshall, Chase Gauthier, Tyler A Gonzalez, J Benjamin Jackson","doi":"10.2106/JBJS.OA.23.00056","DOIUrl":null,"url":null,"abstract":"Background: Artificial intelligence (AI) holds potential in improving medical education and healthcare delivery. ChatGPT is a state-of-the-art natural language processing AI model which has shown impressive capabilities, scoring in the top percentiles on numerous standardized examinations, including the Uniform Bar Exam and Scholastic Aptitude Test. The goal of this study was to evaluate ChatGPT performance on the Orthopaedic In-Training Examination (OITE), an assessment of medical knowledge for orthopedic residents.Methods: OITE 2020, 2021, and 2022 questions without images were inputted into ChatGPT version 3.5 and version 4 (GPT-4) with zero prompting. The performance of ChatGPT was evaluated as a percentage of correct responses and compared with the national average of orthopedic surgery residents at each postgraduate year (PGY) level. ChatGPT was asked to provide a source for its answer, which was categorized as being a journal article, book, or website, and if the source could be verified. Impact factor for the journal cited was also recorded.Results: ChatGPT answered 196 of 360 answers correctly (54.3%), corresponding to a PGY-1 level. ChatGPT cited a verifiable source in 47.2% of questions, with an average median journal impact factor of 5.4. GPT-4 answered 265 of 360 questions correctly (73.6%), corresponding to the average performance of a PGY-5 and exceeding the corresponding passing score for the American Board of Orthopaedic Surgery Part I Examination of 67%. GPT-4 cited a verifiable source in 87.9% of questions, with an average median journal impact factor of 5.2.Conclusions: ChatGPT performed above the average PGY-1 level and GPT-4 performed better than the average PGY-5 level, showing major improvement. Further investigation is needed to determine how successive versions of ChatGPT would perform and how to optimize this technology to improve medical education.Clinical relevance: AI has the potential to aid in medical education and healthcare delivery.","PeriodicalId":36492,"journal":{"name":"JBJS Open Access","volume":"8 3","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2023-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/39/fc/jbjsoa-8-e23.00056.PMC10484364.pdf","citationCount":"0","resultStr":"{\"title\":\"Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination.\",\"authors\":\"Justin E Kung, Christopher Marshall, Chase Gauthier, Tyler A Gonzalez, J Benjamin Jackson\",\"doi\":\"10.2106/JBJS.OA.23.00056\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Artificial intelligence (AI) holds potential in improving medical education and healthcare delivery. ChatGPT is a state-of-the-art natural language processing AI model which has shown impressive capabilities, scoring in the top percentiles on numerous standardized examinations, including the Uniform Bar Exam and Scholastic Aptitude Test. The goal of this study was to evaluate ChatGPT performance on the Orthopaedic In-Training Examination (OITE), an assessment of medical knowledge for orthopedic residents.Methods: OITE 2020, 2021, and 2022 questions without images were inputted into ChatGPT version 3.5 and version 4 (GPT-4) with zero prompting. The performance of ChatGPT was evaluated as a percentage of correct responses and compared with the national average of orthopedic surgery residents at each postgraduate year (PGY) level. ChatGPT was asked to provide a source for its answer, which was categorized as being a journal article, book, or website, and if the source could be verified. Impact factor for the journal cited was also recorded.Results: ChatGPT answered 196 of 360 answers correctly (54.3%), corresponding to a PGY-1 level. ChatGPT cited a verifiable source in 47.2% of questions, with an average median journal impact factor of 5.4. GPT-4 answered 265 of 360 questions correctly (73.6%), corresponding to the average performance of a PGY-5 and exceeding the corresponding passing score for the American Board of Orthopaedic Surgery Part I Examination of 67%. GPT-4 cited a verifiable source in 87.9% of questions, with an average median journal impact factor of 5.2.Conclusions: ChatGPT performed above the average PGY-1 level and GPT-4 performed better than the average PGY-5 level, showing major improvement. Further investigation is needed to determine how successive versions of ChatGPT would perform and how to optimize this technology to improve medical education.Clinical relevance: AI has the potential to aid in medical education and healthcare delivery.\",\"PeriodicalId\":36492,\"journal\":{\"name\":\"JBJS Open Access\",\"volume\":\"8 3\",\"pages\":\"\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2023-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/39/fc/jbjsoa-8-e23.00056.PMC10484364.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JBJS Open Access\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2106/JBJS.OA.23.00056\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2023/7/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JBJS Open Access","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2106/JBJS.OA.23.00056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/7/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

摘要

背景：人工智能（AI）具有改善医学教育和医疗服务的潜力。ChatGPT 是一种最先进的自然语言处理人工智能模型，在包括统一律师资格考试（Uniform Bar Exam）和学术能力倾向测验（Scholastic Aptitude Test）在内的众多标准化考试中均名列前茅，表现出令人印象深刻的能力。本研究的目的是评估 ChatGPT 在骨科住院医师医学知识评估考试（Orthopaedic In-Training Examination，OITE）中的表现：在无提示的情况下，将不带图像的 OITE 2020、2021 和 2022 年试题输入 ChatGPT 3.5 版和 4 版（GPT-4）。ChatGPT 的性能以正确回答的百分比进行评估，并与各研究生年级（PGY）骨科住院医师的全国平均水平进行比较。ChatGPT 被要求提供答案的来源，分为期刊论文、书籍或网站，以及是否可以验证来源。引用期刊的影响因子也被记录在案：ChatGPT 正确回答了 360 个答案中的 196 个（54.3%），相当于 PGY-1 级别。ChatGPT 在 47.2% 的问题中引用了可验证的来源，期刊影响因子的平均中位数为 5.4。GPT-4 正确回答了 360 个问题中的 265 个（73.6%），相当于 PGY-5 的平均水平，超过了美国矫形外科委员会第一部分考试 67% 的合格分数。GPT-4 有 87.9% 的问题引用了可验证的资料来源，平均期刊影响因子中位数为 5.2：ChatGPT 的表现高于 PGY-1 级的平均水平，GPT-4 的表现则优于 PGY-5 级的平均水平，显示出重大进步。我们还需要进一步研究，以确定 ChatGPT 的后续版本会有怎样的表现，以及如何优化这项技术以改善医学教育：人工智能具有帮助医学教育和医疗保健服务的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination.

Background: Artificial intelligence (AI) holds potential in improving medical education and healthcare delivery. ChatGPT is a state-of-the-art natural language processing AI model which has shown impressive capabilities, scoring in the top percentiles on numerous standardized examinations, including the Uniform Bar Exam and Scholastic Aptitude Test. The goal of this study was to evaluate ChatGPT performance on the Orthopaedic In-Training Examination (OITE), an assessment of medical knowledge for orthopedic residents.

Methods: OITE 2020, 2021, and 2022 questions without images were inputted into ChatGPT version 3.5 and version 4 (GPT-4) with zero prompting. The performance of ChatGPT was evaluated as a percentage of correct responses and compared with the national average of orthopedic surgery residents at each postgraduate year (PGY) level. ChatGPT was asked to provide a source for its answer, which was categorized as being a journal article, book, or website, and if the source could be verified. Impact factor for the journal cited was also recorded.

Results: ChatGPT answered 196 of 360 answers correctly (54.3%), corresponding to a PGY-1 level. ChatGPT cited a verifiable source in 47.2% of questions, with an average median journal impact factor of 5.4. GPT-4 answered 265 of 360 questions correctly (73.6%), corresponding to the average performance of a PGY-5 and exceeding the corresponding passing score for the American Board of Orthopaedic Surgery Part I Examination of 67%. GPT-4 cited a verifiable source in 87.9% of questions, with an average median journal impact factor of 5.2.

Conclusions: ChatGPT performed above the average PGY-1 level and GPT-4 performed better than the average PGY-5 level, showing major improvement. Further investigation is needed to determine how successive versions of ChatGPT would perform and how to optimize this technology to improve medical education.

Clinical relevance: AI has the potential to aid in medical education and healthcare delivery.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊