Performance of ChatGPT on the Plastic Surgery In-Training Examination.
EplastyPub Date : 2024-12-18eCollection Date: 2024-01-01
Brielle E Raine, Katherine A Kozlowski, Cody C Fowler, Jordan D Frey
{"title":"Performance of ChatGPT on the Plastic Surgery In-Training Examination.","authors":"Brielle E Raine, Katherine A Kozlowski, Cody C Fowler, Jordan D Frey","doi":"","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Recently, the artificial intelligence chatbot Chat Generative Pre-Trained Transformer (ChatGPT) performed well on all United States Medical Licensing Examinations (USMLE), demonstrating a high level of insight into a physician's knowledge base and clinical reasoning ability.<sup>1,2</sup> This study aims to evaluate the performance of ChatGPT on the American Society of Plastic Surgeons (ASPS) Plastic Surgery In-Training Examination (PSITE) to assess its clinical reasoning and decision-making ability and investigate its legitimacy related to plastic surgery competencies.</p><p><strong>Methods: </strong>PSITE questions from 2015 to 2023 were included in this study. Questions with images, charts, and graphs were excluded. ChatGPT 3.5 was prompted to provide the best single letter answer choice. Performance was analyzed across test years, question area of content, taxonomy, and core competency via chi-square analysis. Multivariable logistic regression was performed to identify predictors of ChatGPT performance.</p><p><strong>Results: </strong>In this study, 1850 of 2097 multiple choice questions were included. ChatGPT answered 845 (45.7%) questions correctly, performing the highest on breast/cosmetic topics (49.6%) (<i>P</i> = .070). ChatGPT performed significantly better on questions requiring the lowest level of reasoning (knowledge, 55.1%) compared with more complex questions such as analysis (41.4%) (<i>P</i> = .001). Multivariable analysis identified negative predictors of performance including the hand/lower extremity topic (OR = 0.73, <i>P</i> = .038) and taxonomy levels beyond knowledge (<i>P</i> < .05). Performance on the 2023 exam (53.4%) corresponded to a 4th percentile score when compared with all plastic surgery residents.</p><p><strong>Conclusions: </strong>While ChatGPT's performance has shown promise in other medical domains, our results indicate it may not be a reliable source of information for plastic surgery-related questions or decision-making.</p>","PeriodicalId":93993,"journal":{"name":"Eplasty","volume":"24 ","pages":"e68"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12132409/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eplasty","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Recently, the artificial intelligence chatbot Chat Generative Pre-Trained Transformer (ChatGPT) performed well on all United States Medical Licensing Examinations (USMLE), demonstrating a high level of insight into a physician's knowledge base and clinical reasoning ability.1,2 This study aims to evaluate the performance of ChatGPT on the American Society of Plastic Surgeons (ASPS) Plastic Surgery In-Training Examination (PSITE) to assess its clinical reasoning and decision-making ability and investigate its legitimacy related to plastic surgery competencies.
Methods: PSITE questions from 2015 to 2023 were included in this study. Questions with images, charts, and graphs were excluded. ChatGPT 3.5 was prompted to provide the best single letter answer choice. Performance was analyzed across test years, question area of content, taxonomy, and core competency via chi-square analysis. Multivariable logistic regression was performed to identify predictors of ChatGPT performance.
Results: In this study, 1850 of 2097 multiple choice questions were included. ChatGPT answered 845 (45.7%) questions correctly, performing the highest on breast/cosmetic topics (49.6%) (P = .070). ChatGPT performed significantly better on questions requiring the lowest level of reasoning (knowledge, 55.1%) compared with more complex questions such as analysis (41.4%) (P = .001). Multivariable analysis identified negative predictors of performance including the hand/lower extremity topic (OR = 0.73, P = .038) and taxonomy levels beyond knowledge (P < .05). Performance on the 2023 exam (53.4%) corresponded to a 4th percentile score when compared with all plastic surgery residents.
Conclusions: While ChatGPT's performance has shown promise in other medical domains, our results indicate it may not be a reliable source of information for plastic surgery-related questions or decision-making.