Exploring the Performance of ChatGPT in an Orthopaedic Setting and Its Potential Use as an Educational Tool.

IF 2.3 Q2 ORTHOPEDICS

JBJS Open Access Pub Date : 2024-11-26 eCollection Date: 2024-10-01 DOI:10.2106/JBJS.OA.24.00081

Arthur Drouaud, Carolina Stocchi, Justin Tang, Grant Gonsalves, Zoe Cheung, Jan Szatkowski, David Forsh

{"title":"Exploring the Performance of ChatGPT in an Orthopaedic Setting and Its Potential Use as an Educational Tool.","authors":"Arthur Drouaud, Carolina Stocchi, Justin Tang, Grant Gonsalves, Zoe Cheung, Jan Szatkowski, David Forsh","doi":"10.2106/JBJS.OA.24.00081","DOIUrl":null,"url":null,"abstract":"Introduction: We assessed ChatGPT-4 vision (GPT-4V)'s performance for image interpretation, diagnosis formulation, and patient management capabilities. We aim to shed light on its potential as an educational tool addressing real-life cases for medical students.Methods: Ten of the most popular orthopaedic trauma cases from OrthoBullets were selected. GPT-4V interpreted medical imaging and patient information, providing diagnoses, and guiding responses to OrthoBullets questions. Four fellowship-trained orthopaedic trauma surgeons rated GPT-4V responses using a 5-point Likert scale (strongly disagree to strongly agree). Each of GPT-4V's answers was assessed for alignment with current medical knowledge (accuracy), rationale and whether it is logical (rationale), relevancy to the specific case (relevance), and whether surgeons would trust the answers (trustworthiness). Mean scores from surgeon ratings were calculated.Results: In total, 10 clinical cases, comprising 97 questions, were analyzed (10 imaging, 35 management, and 52 treatment). The surgeons assigned a mean overall rating of 3.46/5.00 to GPT-4V's imaging response (accuracy 3.28, rationale 3.68, relevance 3.75, and trustworthiness 3.15). Management questions received an overall score of 3.76 (accuracy 3.61, rationale 3.84, relevance 4.01, and trustworthiness 3.58), while treatment questions had an average overall score of 4.04 (accuracy 3.99, rationale 4.08, relevance 4.15, and trustworthiness 3.93).Conclusion: This is the first study evaluating GPT-4V's imaging interpretation, personalized management, and treatment approaches as a medical educational tool. Surgeon ratings indicate overall fair agreement in GPT-4V reasoning behind decision-making. GPT-4V performed less favorably in imaging interpretation compared with its management and treatment approach performance. The performance of GPT-4V falls below our fellowship-trained orthopaedic trauma surgeon's standards as a standalone tool for medical education.","PeriodicalId":36492,"journal":{"name":"JBJS Open Access","volume":"9 4","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11584220/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JBJS Open Access","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2106/JBJS.OA.24.00081","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: We assessed ChatGPT-4 vision (GPT-4V)'s performance for image interpretation, diagnosis formulation, and patient management capabilities. We aim to shed light on its potential as an educational tool addressing real-life cases for medical students.

Methods: Ten of the most popular orthopaedic trauma cases from OrthoBullets were selected. GPT-4V interpreted medical imaging and patient information, providing diagnoses, and guiding responses to OrthoBullets questions. Four fellowship-trained orthopaedic trauma surgeons rated GPT-4V responses using a 5-point Likert scale (strongly disagree to strongly agree). Each of GPT-4V's answers was assessed for alignment with current medical knowledge (accuracy), rationale and whether it is logical (rationale), relevancy to the specific case (relevance), and whether surgeons would trust the answers (trustworthiness). Mean scores from surgeon ratings were calculated.

Results: In total, 10 clinical cases, comprising 97 questions, were analyzed (10 imaging, 35 management, and 52 treatment). The surgeons assigned a mean overall rating of 3.46/5.00 to GPT-4V's imaging response (accuracy 3.28, rationale 3.68, relevance 3.75, and trustworthiness 3.15). Management questions received an overall score of 3.76 (accuracy 3.61, rationale 3.84, relevance 4.01, and trustworthiness 3.58), while treatment questions had an average overall score of 4.04 (accuracy 3.99, rationale 4.08, relevance 4.15, and trustworthiness 3.93).

Conclusion: This is the first study evaluating GPT-4V's imaging interpretation, personalized management, and treatment approaches as a medical educational tool. Surgeon ratings indicate overall fair agreement in GPT-4V reasoning behind decision-making. GPT-4V performed less favorably in imaging interpretation compared with its management and treatment approach performance. The performance of GPT-4V falls below our fellowship-trained orthopaedic trauma surgeon's standards as a standalone tool for medical education.

查看原文本刊更多论文

探索 ChatGPT 在骨科环境中的性能及其作为教育工具的潜在用途。

简介我们评估了 ChatGPT-4 视觉（GPT-4V）在图像解读、诊断制定和患者管理能力方面的表现。我们的目的是揭示其作为医学生处理真实病例的教育工具的潜力：方法：我们从 OrthoBullets 中挑选了 10 个最受欢迎的骨科创伤病例。GPT-4V解读医学影像和患者信息，提供诊断，并指导回答OrthoBullets提出的问题。四名受过研究培训的创伤骨科外科医生使用 5 点李克特量表（从非常不同意到非常同意）对 GPT-4V 的回答进行评分。对 GPT-4V 的每个回答都进行了评估，包括与当前医学知识的一致性（准确性）、合理性和是否符合逻辑（合理性）、与具体病例的相关性（相关性）以及外科医生是否信任这些回答（可信性）。计算了外科医生评分的平均值：结果：共分析了 10 个临床病例，包括 97 个问题（10 个成像问题、35 个管理问题和 52 个治疗问题）。外科医生对 GPT-4V 的成像回答的平均总体评分为 3.46/5.00（准确性 3.28、合理性 3.68、相关性 3.75 和可信度 3.15）。管理问题的总得分为 3.76 分（准确性 3.61 分，合理性 3.84 分，相关性 4.01 分，可信度 3.58 分），而治疗问题的平均总得分为 4.04 分（准确性 3.99 分，合理性 4.08 分，相关性 4.15 分，可信度 3.93 分）：这是第一项评估 GPT-4V 作为医学教育工具的成像解释、个性化管理和治疗方法的研究。外科医生的评分表明，他们对 GPT-4V 决策推理的总体评价尚可。与管理和治疗方法相比，GPT-4V 在成像解读方面的表现较差。作为一种独立的医学教育工具，GPT-4V 的表现低于我们受过研究培训的创伤骨科外科医生的标准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊