Performance of ChatGPT on the Plastic Surgery In-Training Examination.

Eplasty Pub Date : 2024-12-18 eCollection Date: 2024-01-01

Brielle E Raine, Katherine A Kozlowski, Cody C Fowler, Jordan D Frey

{"title":"Performance of ChatGPT on the Plastic Surgery In-Training Examination.","authors":"Brielle E Raine, Katherine A Kozlowski, Cody C Fowler, Jordan D Frey","doi":"","DOIUrl":null,"url":null,"abstract":"Background: Recently, the artificial intelligence chatbot Chat Generative Pre-Trained Transformer (ChatGPT) performed well on all United States Medical Licensing Examinations (USMLE), demonstrating a high level of insight into a physician's knowledge base and clinical reasoning ability.1,2 This study aims to evaluate the performance of ChatGPT on the American Society of Plastic Surgeons (ASPS) Plastic Surgery In-Training Examination (PSITE) to assess its clinical reasoning and decision-making ability and investigate its legitimacy related to plastic surgery competencies.Methods: PSITE questions from 2015 to 2023 were included in this study. Questions with images, charts, and graphs were excluded. ChatGPT 3.5 was prompted to provide the best single letter answer choice. Performance was analyzed across test years, question area of content, taxonomy, and core competency via chi-square analysis. Multivariable logistic regression was performed to identify predictors of ChatGPT performance.Results: In this study, 1850 of 2097 multiple choice questions were included. ChatGPT answered 845 (45.7%) questions correctly, performing the highest on breast/cosmetic topics (49.6%) (P = .070). ChatGPT performed significantly better on questions requiring the lowest level of reasoning (knowledge, 55.1%) compared with more complex questions such as analysis (41.4%) (P = .001). Multivariable analysis identified negative predictors of performance including the hand/lower extremity topic (OR = 0.73, P = .038) and taxonomy levels beyond knowledge (P < .05). Performance on the 2023 exam (53.4%) corresponded to a 4th percentile score when compared with all plastic surgery residents.Conclusions: While ChatGPT's performance has shown promise in other medical domains, our results indicate it may not be a reliable source of information for plastic surgery-related questions or decision-making.","PeriodicalId":93993,"journal":{"name":"Eplasty","volume":"24 ","pages":"e68"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12132409/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eplasty","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Recently, the artificial intelligence chatbot Chat Generative Pre-Trained Transformer (ChatGPT) performed well on all United States Medical Licensing Examinations (USMLE), demonstrating a high level of insight into a physician's knowledge base and clinical reasoning ability.^1,2 This study aims to evaluate the performance of ChatGPT on the American Society of Plastic Surgeons (ASPS) Plastic Surgery In-Training Examination (PSITE) to assess its clinical reasoning and decision-making ability and investigate its legitimacy related to plastic surgery competencies.

Methods: PSITE questions from 2015 to 2023 were included in this study. Questions with images, charts, and graphs were excluded. ChatGPT 3.5 was prompted to provide the best single letter answer choice. Performance was analyzed across test years, question area of content, taxonomy, and core competency via chi-square analysis. Multivariable logistic regression was performed to identify predictors of ChatGPT performance.

Results: In this study, 1850 of 2097 multiple choice questions were included. ChatGPT answered 845 (45.7%) questions correctly, performing the highest on breast/cosmetic topics (49.6%) (P = .070). ChatGPT performed significantly better on questions requiring the lowest level of reasoning (knowledge, 55.1%) compared with more complex questions such as analysis (41.4%) (P = .001). Multivariable analysis identified negative predictors of performance including the hand/lower extremity topic (OR = 0.73, P = .038) and taxonomy levels beyond knowledge (P < .05). Performance on the 2023 exam (53.4%) corresponded to a 4th percentile score when compared with all plastic surgery residents.

Conclusions: While ChatGPT's performance has shown promise in other medical domains, our results indicate it may not be a reliable source of information for plastic surgery-related questions or decision-making.

本刊更多论文

ChatGPT在整形外科培训考试中的表现。

背景：最近，人工智能聊天机器人聊天生成预训练变压器（ChatGPT）在所有美国医疗执照考试（USMLE）中表现良好，显示出对医生知识基础和临床推理能力的高水平洞察力。1,2本研究旨在评估ChatGPT在美国整形外科学会（ASPS）整形外科培训考试（PSITE）中的表现，以评估其临床推理和决策能力，并探讨其与整形外科能力相关的合法性。方法：本研究纳入2015 - 2023年PSITE问题。带有图像、图表和图形的问题被排除在外。ChatGPT 3.5提示提供最佳的单字母答案选择。通过卡方分析，分析了不同测试年份、问题范围、分类和核心能力的表现。采用多变量逻辑回归来确定ChatGPT性能的预测因子。结果：本研究共纳入2097道选择题中的1850道。ChatGPT正确回答了845个问题（45.7%），在乳房/美容主题上表现最高（49.6%）（P = 0.070）。ChatGPT在要求最低推理水平的问题（知识，55.1%）上的表现明显优于分析等更复杂的问题（41.4%）（P = .001）。多变量分析发现，包括手/下肢主题（OR = 0.73, P = 0.038）和超越知识的分类水平（P < 0.05）在内的负向预测因素影响了成绩。与所有整形外科住院医师相比，2023年的考试成绩（53.4%）仅为第4百分位。结论：虽然ChatGPT的表现在其他医疗领域显示出前景，但我们的研究结果表明，它可能不是整形手术相关问题或决策的可靠信息来源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Eplasty

自引率

0.00%

发文量