评估ChatGPT-4和骨科临床医生在影像学骨折分类中的可靠性。

IF 1.8 3区 医学 Q3 ORTHOPEDICS
Aliyah N Walker, J B Smith, Samuel K Simister, Om Patel, Soham Choudhary, Michael Seidu, David Dallas-Orr, Shannon Tse, Hania Shahzad, Patrick Wise, Michelle Scott, Augustine M Saiz, Zachary C Lum
{"title":"评估ChatGPT-4和骨科临床医生在影像学骨折分类中的可靠性。","authors":"Aliyah N Walker, J B Smith, Samuel K Simister, Om Patel, Soham Choudhary, Michael Seidu, David Dallas-Orr, Shannon Tse, Hania Shahzad, Patrick Wise, Michelle Scott, Augustine M Saiz, Zachary C Lum","doi":"10.1097/BOT.0000000000003079","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>To assess the inter-rater reliability of ChatGPT-4 to that of orthopaedic surgery attendings and residents in classifying fractures on upper extremity (UE) and lower extremity (LE) radiographs.</p><p><strong>Methods: </strong>84 radiographs of various fracture patterns were collected from publicly available online repositories. These images were presented to ChatGPT-4 with the prompt asking it to identify the view, body location, fracture type, and AO/OTA fracture classification. Two orthopaedic surgery residents and two attending orthopaedic surgeons also independently reviewed the images and identified the same categories. Fleiss' Kappa values were calculated to determine inter-rater reliability (IRR) for the following: All Raters Combined, AI vs. Residents (AIR); AI vs. Attendings (AIA); Attendings vs. Residents (AR).</p><p><strong>Results: </strong>ChatGPT-4 achieved substantial to almost perfect agreement with clinicians on location (UE: κ = 0.655-0.708, LE: κ = 0.834-0.909) and fracture type (UE: κ = 0.546-0.563, LE: κ = 0.58-0.697). For view, ChatGPT-4 showed consistent fair agreement for both UE (κ = 0.370-0.404) and LE (κ = 0.309-0.390). ChatGPT-4 struggled the most with AO/OTA classification achieving slight agreement for UE (κ = -0.062-0.159) and moderate agreement for LE (κ = 0.418-0.455). IRR for AIR was consistently lower than IRR for AR. For AR comparisons, almost perfect agreement was observed for location (UE: κ = 0.896, LE: κ = 0.912) and fracture type (UE: κ = 0.948, LE: κ = 0.859), while AO/OTA classification showed fair agreement for UE (κ = 0.257) and moderate for LE (κ = 0.517). The p-values for all comparison groups were significant except for LE AO/OTA classification between AI and residents (p = 0.051).</p><p><strong>Conclusions: </strong>Although ChatcGPT-4 showed promise in classifying basic fracture features, it was not yet at a level comparable to experts, especially with more nuanced interpretations. These findings suggest that the use of AI is more effective as an adjunct to the judgment of trained clinicians rather than a replacement for it.</p>","PeriodicalId":16644,"journal":{"name":"Journal of Orthopaedic Trauma","volume":" ","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2025-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessing Inter-rater Reliability of ChatGPT-4 and Orthopaedic Clinicians in Radiographic Fracture Classification.\",\"authors\":\"Aliyah N Walker, J B Smith, Samuel K Simister, Om Patel, Soham Choudhary, Michael Seidu, David Dallas-Orr, Shannon Tse, Hania Shahzad, Patrick Wise, Michelle Scott, Augustine M Saiz, Zachary C Lum\",\"doi\":\"10.1097/BOT.0000000000003079\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>To assess the inter-rater reliability of ChatGPT-4 to that of orthopaedic surgery attendings and residents in classifying fractures on upper extremity (UE) and lower extremity (LE) radiographs.</p><p><strong>Methods: </strong>84 radiographs of various fracture patterns were collected from publicly available online repositories. These images were presented to ChatGPT-4 with the prompt asking it to identify the view, body location, fracture type, and AO/OTA fracture classification. Two orthopaedic surgery residents and two attending orthopaedic surgeons also independently reviewed the images and identified the same categories. Fleiss' Kappa values were calculated to determine inter-rater reliability (IRR) for the following: All Raters Combined, AI vs. Residents (AIR); AI vs. Attendings (AIA); Attendings vs. Residents (AR).</p><p><strong>Results: </strong>ChatGPT-4 achieved substantial to almost perfect agreement with clinicians on location (UE: κ = 0.655-0.708, LE: κ = 0.834-0.909) and fracture type (UE: κ = 0.546-0.563, LE: κ = 0.58-0.697). For view, ChatGPT-4 showed consistent fair agreement for both UE (κ = 0.370-0.404) and LE (κ = 0.309-0.390). ChatGPT-4 struggled the most with AO/OTA classification achieving slight agreement for UE (κ = -0.062-0.159) and moderate agreement for LE (κ = 0.418-0.455). IRR for AIR was consistently lower than IRR for AR. For AR comparisons, almost perfect agreement was observed for location (UE: κ = 0.896, LE: κ = 0.912) and fracture type (UE: κ = 0.948, LE: κ = 0.859), while AO/OTA classification showed fair agreement for UE (κ = 0.257) and moderate for LE (κ = 0.517). The p-values for all comparison groups were significant except for LE AO/OTA classification between AI and residents (p = 0.051).</p><p><strong>Conclusions: </strong>Although ChatcGPT-4 showed promise in classifying basic fracture features, it was not yet at a level comparable to experts, especially with more nuanced interpretations. These findings suggest that the use of AI is more effective as an adjunct to the judgment of trained clinicians rather than a replacement for it.</p>\",\"PeriodicalId\":16644,\"journal\":{\"name\":\"Journal of Orthopaedic Trauma\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Orthopaedic Trauma\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/BOT.0000000000003079\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Orthopaedic Trauma","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/BOT.0000000000003079","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0

摘要

目的:评估ChatGPT-4对骨科主治医师和住院医师在上肢(UE)和下肢(LE) x线片骨折分类中的可信度。方法:从网上公开的数据库中收集84张不同骨折类型的x线片。将这些图像提交给ChatGPT-4,提示其识别视图、体位、骨折类型和AO/OTA骨折分类。两名骨科住院医师和两名主治骨科医生也独立审查了图像,并确定了相同的类别。计算Fleiss Kappa值以确定以下评估者间信度(IRR):所有评估者组合,AI与居民(AIR);AI vs. Attendings (AIA);主治医师vs.住院医师(AR)。结果:ChatGPT-4在骨折位置(UE: κ = 0.655-0.708, LE: κ = 0.834-0.909)和骨折类型(UE: κ = 0.546-0.563, LE: κ = 0.58-0.697)方面与临床医生基本一致。因此,ChatGPT-4对UE (κ = 0.370-0.404)和LE (κ = 0.309-0.390)均表现出一致的公平一致性。ChatGPT-4在AO/OTA分类方面最困难,在UE上有轻微的一致性(κ = -0.062-0.159),在LE上有中度的一致性(κ = 0.418-0.455)。AIR的IRR始终低于AR的IRR。对于AR的比较,在位置(UE: κ = 0.896, LE: κ = 0.912)和骨折类型(UE: κ = 0.948, LE: κ = 0.859)上观察到几乎完全一致,而AO/OTA分类在UE (κ = 0.257)和LE (κ = 0.517)上显示出一般一致。除了AI和居民之间的LE AO/OTA分类(p = 0.051),其他各组的p值均显著。结论:尽管ChatcGPT-4在对基本骨折特征进行分类方面显示出了希望,但目前还没有达到与专家相比的水平,特别是在更细致的解释方面。这些发现表明,使用人工智能作为训练有素的临床医生判断的辅助而不是替代它更有效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Assessing Inter-rater Reliability of ChatGPT-4 and Orthopaedic Clinicians in Radiographic Fracture Classification.

Objectives: To assess the inter-rater reliability of ChatGPT-4 to that of orthopaedic surgery attendings and residents in classifying fractures on upper extremity (UE) and lower extremity (LE) radiographs.

Methods: 84 radiographs of various fracture patterns were collected from publicly available online repositories. These images were presented to ChatGPT-4 with the prompt asking it to identify the view, body location, fracture type, and AO/OTA fracture classification. Two orthopaedic surgery residents and two attending orthopaedic surgeons also independently reviewed the images and identified the same categories. Fleiss' Kappa values were calculated to determine inter-rater reliability (IRR) for the following: All Raters Combined, AI vs. Residents (AIR); AI vs. Attendings (AIA); Attendings vs. Residents (AR).

Results: ChatGPT-4 achieved substantial to almost perfect agreement with clinicians on location (UE: κ = 0.655-0.708, LE: κ = 0.834-0.909) and fracture type (UE: κ = 0.546-0.563, LE: κ = 0.58-0.697). For view, ChatGPT-4 showed consistent fair agreement for both UE (κ = 0.370-0.404) and LE (κ = 0.309-0.390). ChatGPT-4 struggled the most with AO/OTA classification achieving slight agreement for UE (κ = -0.062-0.159) and moderate agreement for LE (κ = 0.418-0.455). IRR for AIR was consistently lower than IRR for AR. For AR comparisons, almost perfect agreement was observed for location (UE: κ = 0.896, LE: κ = 0.912) and fracture type (UE: κ = 0.948, LE: κ = 0.859), while AO/OTA classification showed fair agreement for UE (κ = 0.257) and moderate for LE (κ = 0.517). The p-values for all comparison groups were significant except for LE AO/OTA classification between AI and residents (p = 0.051).

Conclusions: Although ChatcGPT-4 showed promise in classifying basic fracture features, it was not yet at a level comparable to experts, especially with more nuanced interpretations. These findings suggest that the use of AI is more effective as an adjunct to the judgment of trained clinicians rather than a replacement for it.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Orthopaedic Trauma
Journal of Orthopaedic Trauma 医学-运动科学
CiteScore
3.90
自引率
8.70%
发文量
396
审稿时长
3-8 weeks
期刊介绍: Journal of Orthopaedic Trauma is devoted exclusively to the diagnosis and management of hard and soft tissue trauma, including injuries to bone, muscle, ligament, and tendons, as well as spinal cord injuries. Under the guidance of a distinguished international board of editors, the journal provides the most current information on diagnostic techniques, new and improved surgical instruments and procedures, surgical implants and prosthetic devices, bioplastics and biometals; and physical therapy and rehabilitation.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信