Assessing Inter-rater Reliability of ChatGPT-4 and Orthopaedic Clinicians in Radiographic Fracture Classification.

IF 1.8 3区医学 Q3 ORTHOPEDICS

Journal of Orthopaedic Trauma Pub Date : 2025-09-19 DOI:10.1097/BOT.0000000000003079

Aliyah N Walker, J B Smith, Samuel K Simister, Om Patel, Soham Choudhary, Michael Seidu, David Dallas-Orr, Shannon Tse, Hania Shahzad, Patrick Wise, Michelle Scott, Augustine M Saiz, Zachary C Lum

{"title":"Assessing Inter-rater Reliability of ChatGPT-4 and Orthopaedic Clinicians in Radiographic Fracture Classification.","authors":"Aliyah N Walker, J B Smith, Samuel K Simister, Om Patel, Soham Choudhary, Michael Seidu, David Dallas-Orr, Shannon Tse, Hania Shahzad, Patrick Wise, Michelle Scott, Augustine M Saiz, Zachary C Lum","doi":"10.1097/BOT.0000000000003079","DOIUrl":null,"url":null,"abstract":"Objectives: To assess the inter-rater reliability of ChatGPT-4 to that of orthopaedic surgery attendings and residents in classifying fractures on upper extremity (UE) and lower extremity (LE) radiographs.Methods: 84 radiographs of various fracture patterns were collected from publicly available online repositories. These images were presented to ChatGPT-4 with the prompt asking it to identify the view, body location, fracture type, and AO/OTA fracture classification. Two orthopaedic surgery residents and two attending orthopaedic surgeons also independently reviewed the images and identified the same categories. Fleiss' Kappa values were calculated to determine inter-rater reliability (IRR) for the following: All Raters Combined, AI vs. Residents (AIR); AI vs. Attendings (AIA); Attendings vs. Residents (AR).Results: ChatGPT-4 achieved substantial to almost perfect agreement with clinicians on location (UE: κ = 0.655-0.708, LE: κ = 0.834-0.909) and fracture type (UE: κ = 0.546-0.563, LE: κ = 0.58-0.697). For view, ChatGPT-4 showed consistent fair agreement for both UE (κ = 0.370-0.404) and LE (κ = 0.309-0.390). ChatGPT-4 struggled the most with AO/OTA classification achieving slight agreement for UE (κ = -0.062-0.159) and moderate agreement for LE (κ = 0.418-0.455). IRR for AIR was consistently lower than IRR for AR. For AR comparisons, almost perfect agreement was observed for location (UE: κ = 0.896, LE: κ = 0.912) and fracture type (UE: κ = 0.948, LE: κ = 0.859), while AO/OTA classification showed fair agreement for UE (κ = 0.257) and moderate for LE (κ = 0.517). The p-values for all comparison groups were significant except for LE AO/OTA classification between AI and residents (p = 0.051).Conclusions: Although ChatcGPT-4 showed promise in classifying basic fracture features, it was not yet at a level comparable to experts, especially with more nuanced interpretations. These findings suggest that the use of AI is more effective as an adjunct to the judgment of trained clinicians rather than a replacement for it.","PeriodicalId":16644,"journal":{"name":"Journal of Orthopaedic Trauma","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Orthopaedic Trauma","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/BOT.0000000000003079","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives: To assess the inter-rater reliability of ChatGPT-4 to that of orthopaedic surgery attendings and residents in classifying fractures on upper extremity (UE) and lower extremity (LE) radiographs.

Methods: 84 radiographs of various fracture patterns were collected from publicly available online repositories. These images were presented to ChatGPT-4 with the prompt asking it to identify the view, body location, fracture type, and AO/OTA fracture classification. Two orthopaedic surgery residents and two attending orthopaedic surgeons also independently reviewed the images and identified the same categories. Fleiss' Kappa values were calculated to determine inter-rater reliability (IRR) for the following: All Raters Combined, AI vs. Residents (AIR); AI vs. Attendings (AIA); Attendings vs. Residents (AR).

Results: ChatGPT-4 achieved substantial to almost perfect agreement with clinicians on location (UE: κ = 0.655-0.708, LE: κ = 0.834-0.909) and fracture type (UE: κ = 0.546-0.563, LE: κ = 0.58-0.697). For view, ChatGPT-4 showed consistent fair agreement for both UE (κ = 0.370-0.404) and LE (κ = 0.309-0.390). ChatGPT-4 struggled the most with AO/OTA classification achieving slight agreement for UE (κ = -0.062-0.159) and moderate agreement for LE (κ = 0.418-0.455). IRR for AIR was consistently lower than IRR for AR. For AR comparisons, almost perfect agreement was observed for location (UE: κ = 0.896, LE: κ = 0.912) and fracture type (UE: κ = 0.948, LE: κ = 0.859), while AO/OTA classification showed fair agreement for UE (κ = 0.257) and moderate for LE (κ = 0.517). The p-values for all comparison groups were significant except for LE AO/OTA classification between AI and residents (p = 0.051).

Conclusions: Although ChatcGPT-4 showed promise in classifying basic fracture features, it was not yet at a level comparable to experts, especially with more nuanced interpretations. These findings suggest that the use of AI is more effective as an adjunct to the judgment of trained clinicians rather than a replacement for it.

查看原文本刊更多论文

评估ChatGPT-4和骨科临床医生在影像学骨折分类中的可靠性。

目的：评估ChatGPT-4对骨科主治医师和住院医师在上肢（UE）和下肢（LE） x线片骨折分类中的可信度。方法：从网上公开的数据库中收集84张不同骨折类型的x线片。将这些图像提交给ChatGPT-4，提示其识别视图、体位、骨折类型和AO/OTA骨折分类。两名骨科住院医师和两名主治骨科医生也独立审查了图像，并确定了相同的类别。计算Fleiss Kappa值以确定以下评估者间信度（IRR）：所有评估者组合，AI与居民（AIR）；AI vs. Attendings (AIA)；主治医师vs.住院医师（AR）。结果：ChatGPT-4在骨折位置（UE: κ = 0.655-0.708, LE: κ = 0.834-0.909）和骨折类型（UE: κ = 0.546-0.563, LE: κ = 0.58-0.697）方面与临床医生基本一致。因此，ChatGPT-4对UE （κ = 0.370-0.404）和LE （κ = 0.309-0.390）均表现出一致的公平一致性。ChatGPT-4在AO/OTA分类方面最困难，在UE上有轻微的一致性（κ = -0.062-0.159），在LE上有中度的一致性（κ = 0.418-0.455）。AIR的IRR始终低于AR的IRR。对于AR的比较，在位置（UE: κ = 0.896, LE: κ = 0.912）和骨折类型（UE: κ = 0.948, LE: κ = 0.859）上观察到几乎完全一致，而AO/OTA分类在UE （κ = 0.257）和LE （κ = 0.517）上显示出一般一致。除了AI和居民之间的LE AO/OTA分类（p = 0.051），其他各组的p值均显著。结论：尽管ChatcGPT-4在对基本骨折特征进行分类方面显示出了希望，但目前还没有达到与专家相比的水平，特别是在更细致的解释方面。这些发现表明，使用人工智能作为训练有素的临床医生判断的辅助而不是替代它更有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Orthopaedic Trauma 医学-运动科学

CiteScore

3.90

自引率

8.70%

发文量

396

审稿时长

3-8 weeks

期刊介绍： Journal of Orthopaedic Trauma is devoted exclusively to the diagnosis and management of hard and soft tissue trauma, including injuries to bone, muscle, ligament, and tendons, as well as spinal cord injuries. Under the guidance of a distinguished international board of editors, the journal provides the most current information on diagnostic techniques, new and improved surgical instruments and procedures, surgical implants and prosthetic devices, bioplastics and biometals; and physical therapy and rehabilitation.