Aliyah N Walker, J B Smith, Samuel K Simister, Om Patel, Soham Choudhary, Michael Seidu, David Dallas-Orr, Shannon Tse, Hania Shahzad, Patrick Wise, Michelle Scott, Augustine M Saiz, Zachary C Lum
{"title":"Assessing Inter-rater Reliability of ChatGPT-4 and Orthopaedic Clinicians in Radiographic Fracture Classification.","authors":"Aliyah N Walker, J B Smith, Samuel K Simister, Om Patel, Soham Choudhary, Michael Seidu, David Dallas-Orr, Shannon Tse, Hania Shahzad, Patrick Wise, Michelle Scott, Augustine M Saiz, Zachary C Lum","doi":"10.1097/BOT.0000000000003079","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>To assess the inter-rater reliability of ChatGPT-4 to that of orthopaedic surgery attendings and residents in classifying fractures on upper extremity (UE) and lower extremity (LE) radiographs.</p><p><strong>Methods: </strong>84 radiographs of various fracture patterns were collected from publicly available online repositories. These images were presented to ChatGPT-4 with the prompt asking it to identify the view, body location, fracture type, and AO/OTA fracture classification. Two orthopaedic surgery residents and two attending orthopaedic surgeons also independently reviewed the images and identified the same categories. Fleiss' Kappa values were calculated to determine inter-rater reliability (IRR) for the following: All Raters Combined, AI vs. Residents (AIR); AI vs. Attendings (AIA); Attendings vs. Residents (AR).</p><p><strong>Results: </strong>ChatGPT-4 achieved substantial to almost perfect agreement with clinicians on location (UE: κ = 0.655-0.708, LE: κ = 0.834-0.909) and fracture type (UE: κ = 0.546-0.563, LE: κ = 0.58-0.697). For view, ChatGPT-4 showed consistent fair agreement for both UE (κ = 0.370-0.404) and LE (κ = 0.309-0.390). ChatGPT-4 struggled the most with AO/OTA classification achieving slight agreement for UE (κ = -0.062-0.159) and moderate agreement for LE (κ = 0.418-0.455). IRR for AIR was consistently lower than IRR for AR. For AR comparisons, almost perfect agreement was observed for location (UE: κ = 0.896, LE: κ = 0.912) and fracture type (UE: κ = 0.948, LE: κ = 0.859), while AO/OTA classification showed fair agreement for UE (κ = 0.257) and moderate for LE (κ = 0.517). The p-values for all comparison groups were significant except for LE AO/OTA classification between AI and residents (p = 0.051).</p><p><strong>Conclusions: </strong>Although ChatcGPT-4 showed promise in classifying basic fracture features, it was not yet at a level comparable to experts, especially with more nuanced interpretations. These findings suggest that the use of AI is more effective as an adjunct to the judgment of trained clinicians rather than a replacement for it.</p>","PeriodicalId":16644,"journal":{"name":"Journal of Orthopaedic Trauma","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Orthopaedic Trauma","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/BOT.0000000000003079","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives: To assess the inter-rater reliability of ChatGPT-4 to that of orthopaedic surgery attendings and residents in classifying fractures on upper extremity (UE) and lower extremity (LE) radiographs.
Methods: 84 radiographs of various fracture patterns were collected from publicly available online repositories. These images were presented to ChatGPT-4 with the prompt asking it to identify the view, body location, fracture type, and AO/OTA fracture classification. Two orthopaedic surgery residents and two attending orthopaedic surgeons also independently reviewed the images and identified the same categories. Fleiss' Kappa values were calculated to determine inter-rater reliability (IRR) for the following: All Raters Combined, AI vs. Residents (AIR); AI vs. Attendings (AIA); Attendings vs. Residents (AR).
Results: ChatGPT-4 achieved substantial to almost perfect agreement with clinicians on location (UE: κ = 0.655-0.708, LE: κ = 0.834-0.909) and fracture type (UE: κ = 0.546-0.563, LE: κ = 0.58-0.697). For view, ChatGPT-4 showed consistent fair agreement for both UE (κ = 0.370-0.404) and LE (κ = 0.309-0.390). ChatGPT-4 struggled the most with AO/OTA classification achieving slight agreement for UE (κ = -0.062-0.159) and moderate agreement for LE (κ = 0.418-0.455). IRR for AIR was consistently lower than IRR for AR. For AR comparisons, almost perfect agreement was observed for location (UE: κ = 0.896, LE: κ = 0.912) and fracture type (UE: κ = 0.948, LE: κ = 0.859), while AO/OTA classification showed fair agreement for UE (κ = 0.257) and moderate for LE (κ = 0.517). The p-values for all comparison groups were significant except for LE AO/OTA classification between AI and residents (p = 0.051).
Conclusions: Although ChatcGPT-4 showed promise in classifying basic fracture features, it was not yet at a level comparable to experts, especially with more nuanced interpretations. These findings suggest that the use of AI is more effective as an adjunct to the judgment of trained clinicians rather than a replacement for it.
期刊介绍:
Journal of Orthopaedic Trauma is devoted exclusively to the diagnosis and management of hard and soft tissue trauma, including injuries to bone, muscle, ligament, and tendons, as well as spinal cord injuries. Under the guidance of a distinguished international board of editors, the journal provides the most current information on diagnostic techniques, new and improved surgical instruments and procedures, surgical implants and prosthetic devices, bioplastics and biometals; and physical therapy and rehabilitation.