比较 AAOS 适当使用标准与 ChatGPT-4o 关于治疗桡骨远端骨折的建议。

IF 0.9 4区医学 Q4 ORTHOPEDICS

Hand Surgery & Rehabilitation Pub Date : 2025-04-01 DOI:10.1016/j.hansur.2025.102122

Kareem S. Mohamed , Alexander Yu , Christoph A. Schroen , Akiro Duey , James Hong , Ryan Yu , Suhas Etigunta , Jamie Kator , Hannah S. Rhee , Michael R. Hausman

{"title":"比较 AAOS 适当使用标准与 ChatGPT-4o 关于治疗桡骨远端骨折的建议。","authors":"Kareem S. Mohamed , Alexander Yu , Christoph A. Schroen , Akiro Duey , James Hong , Ryan Yu , Suhas Etigunta , Jamie Kator , Hannah S. Rhee , Michael R. Hausman","doi":"10.1016/j.hansur.2025.102122","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>The American Academy of Orthopaedic Surgeons (AAOS) developed appropriate use criteria (AUC) to guide treatment decisions for distal radius fractures based on expert consensus. This study aims to evaluate the accuracy of Chat Generative Pre-trained Transformer-4o (ChatGPT-4o) by comparing its appropriateness scores for distal radius fracture treatment with those from the AUC.</div></div><div><h3>Methods</h3><div>The AUC patient scenarios were categorized by factors such as fracture type (AO/OTA classification), mechanism of injury, pre-injury activity level, patient health (ASA 1–4), and associated injuries. Treatment options included percutaneous pinning, spanning external fixation, volar locking plates, dorsal plates, and immobilization methods, among others. Orthopedic surgeons assigned appropriateness scores for each treatment (1–3 = “Rarely Appropriate,” 4–6 = “May Be Appropriate,” and 7–9 = “Appropriate”). ChatGPT-4o was prompted with the same patient scenarios and asked to assign scores. Differences between AAOS and ChatGPT-4o ratings were used to calculate mean error, mean absolute error, and mean squared error. Statistical significance was assessed using Spearman correlation, and appropriateness scores were grouped into categories to determine percentage overlap between the two sources.</div></div><div><h3>Results</h3><div>A total of 240 patient scenarios and 2160 paired treatment scores were analyzed. The mean error for treatment options ranged from 0.6 for volar locking plate to -2.9 for dorsal plating. Pearson correlation revealed significant positive associations for dorsal spanning bridge (0.43, P = <0.001) and spanning external fixation (0.4, P = <0.001). The percentage overlap between AAOS and ChatGPT-4o in the appropriateness categories varied, with 99.17% agreement for immobilization without reduction, 90.42% for volar locking plates, and only 15% for dorsal plating.</div></div><div><h3>Conclusion</h3><div>ChatGPT-4o does not consistently align with the appropriate use criteria in determining appropriate management of distal radius fractures. While there was moderate concordance in certain treatments, ChatGPT-4o tended to favor more conservative approaches, raising concerns about the reliability of AI-generated recommendations for medical advice and clinical decision-making.</div></div>","PeriodicalId":54301,"journal":{"name":"Hand Surgery & Rehabilitation","volume":"44 2","pages":"Article 102122"},"PeriodicalIF":0.9000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing AAOS appropriate use criteria with ChatGPT-4o recommendations on treating distal radius fractures\",\"authors\":\"Kareem S. Mohamed , Alexander Yu , Christoph A. Schroen , Akiro Duey , James Hong , Ryan Yu , Suhas Etigunta , Jamie Kator , Hannah S. Rhee , Michael R. Hausman\",\"doi\":\"10.1016/j.hansur.2025.102122\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Introduction</h3><div>The American Academy of Orthopaedic Surgeons (AAOS) developed appropriate use criteria (AUC) to guide treatment decisions for distal radius fractures based on expert consensus. This study aims to evaluate the accuracy of Chat Generative Pre-trained Transformer-4o (ChatGPT-4o) by comparing its appropriateness scores for distal radius fracture treatment with those from the AUC.</div></div><div><h3>Methods</h3><div>The AUC patient scenarios were categorized by factors such as fracture type (AO/OTA classification), mechanism of injury, pre-injury activity level, patient health (ASA 1–4), and associated injuries. Treatment options included percutaneous pinning, spanning external fixation, volar locking plates, dorsal plates, and immobilization methods, among others. Orthopedic surgeons assigned appropriateness scores for each treatment (1–3 = “Rarely Appropriate,” 4–6 = “May Be Appropriate,” and 7–9 = “Appropriate”). ChatGPT-4o was prompted with the same patient scenarios and asked to assign scores. Differences between AAOS and ChatGPT-4o ratings were used to calculate mean error, mean absolute error, and mean squared error. Statistical significance was assessed using Spearman correlation, and appropriateness scores were grouped into categories to determine percentage overlap between the two sources.</div></div><div><h3>Results</h3><div>A total of 240 patient scenarios and 2160 paired treatment scores were analyzed. The mean error for treatment options ranged from 0.6 for volar locking plate to -2.9 for dorsal plating. Pearson correlation revealed significant positive associations for dorsal spanning bridge (0.43, P = <0.001) and spanning external fixation (0.4, P = <0.001). The percentage overlap between AAOS and ChatGPT-4o in the appropriateness categories varied, with 99.17% agreement for immobilization without reduction, 90.42% for volar locking plates, and only 15% for dorsal plating.</div></div><div><h3>Conclusion</h3><div>ChatGPT-4o does not consistently align with the appropriate use criteria in determining appropriate management of distal radius fractures. While there was moderate concordance in certain treatments, ChatGPT-4o tended to favor more conservative approaches, raising concerns about the reliability of AI-generated recommendations for medical advice and clinical decision-making.</div></div>\",\"PeriodicalId\":54301,\"journal\":{\"name\":\"Hand Surgery & Rehabilitation\",\"volume\":\"44 2\",\"pages\":\"Article 102122\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Hand Surgery & Rehabilitation\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2468122925000441\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Hand Surgery & Rehabilitation","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2468122925000441","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

摘要

美国骨科医师学会（AAOS）在专家共识的基础上制定了适当的使用标准（AUC）来指导桡骨远端骨折的治疗决策。本研究旨在通过比较chatgpt - 40在桡骨远端骨折治疗中的适用性评分，来评估chatgpt - 40 （Chat Generative pretrained transformer - 40）的准确性。方法：根据骨折类型（AO/OTA分类）、损伤机制、损伤前活动水平、患者健康状况（ASA 1-4）和相关损伤等因素对AUC患者进行分类。治疗方案包括经皮钉钉、跨外固定、掌侧锁定钢板、背侧钢板和固定方法等。骨科医生为每种治疗方法分配适当性评分（1-3 =“很少合适”，4-6 =“可能合适”，7-9 =“合适”）。chatgpt - 40被提示相同的患者场景，并被要求评分。AAOS和chatgpt - 40评分之间的差异用于计算平均误差、平均绝对误差和均方误差。使用Spearman相关性评估统计显著性，并将适当性评分分组以确定两个来源之间重叠的百分比。结果：共分析了240个患者方案和2160个配对治疗评分。治疗方案的平均误差从掌侧锁定钢板的0.6到背侧钢板的-2.9不等。Pearson相关性显示背侧跨桥有显著的正相关（0.43,P =）。结论：chatgpt - 40在确定桡骨远端骨折的适当治疗时并不始终符合适当的使用标准。虽然在某些治疗方法中存在适度的一致性，但chatgpt - 40倾向于更保守的方法，这引起了人们对人工智能生成的医疗建议和临床决策建议的可靠性的担忧。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comparing AAOS appropriate use criteria with ChatGPT-4o recommendations on treating distal radius fractures

Introduction

The American Academy of Orthopaedic Surgeons (AAOS) developed appropriate use criteria (AUC) to guide treatment decisions for distal radius fractures based on expert consensus. This study aims to evaluate the accuracy of Chat Generative Pre-trained Transformer-4o (ChatGPT-4o) by comparing its appropriateness scores for distal radius fracture treatment with those from the AUC.

Methods

The AUC patient scenarios were categorized by factors such as fracture type (AO/OTA classification), mechanism of injury, pre-injury activity level, patient health (ASA 1–4), and associated injuries. Treatment options included percutaneous pinning, spanning external fixation, volar locking plates, dorsal plates, and immobilization methods, among others. Orthopedic surgeons assigned appropriateness scores for each treatment (1–3 = “Rarely Appropriate,” 4–6 = “May Be Appropriate,” and 7–9 = “Appropriate”). ChatGPT-4o was prompted with the same patient scenarios and asked to assign scores. Differences between AAOS and ChatGPT-4o ratings were used to calculate mean error, mean absolute error, and mean squared error. Statistical significance was assessed using Spearman correlation, and appropriateness scores were grouped into categories to determine percentage overlap between the two sources.

Results

A total of 240 patient scenarios and 2160 paired treatment scores were analyzed. The mean error for treatment options ranged from 0.6 for volar locking plate to -2.9 for dorsal plating. Pearson correlation revealed significant positive associations for dorsal spanning bridge (0.43, P = <0.001) and spanning external fixation (0.4, P = <0.001). The percentage overlap between AAOS and ChatGPT-4o in the appropriateness categories varied, with 99.17% agreement for immobilization without reduction, 90.42% for volar locking plates, and only 15% for dorsal plating.

Conclusion

ChatGPT-4o does not consistently align with the appropriate use criteria in determining appropriate management of distal radius fractures. While there was moderate concordance in certain treatments, ChatGPT-4o tended to favor more conservative approaches, raising concerns about the reliability of AI-generated recommendations for medical advice and clinical decision-making.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Hand Surgery & Rehabilitation Medicine-Surgery

CiteScore

1.70

自引率

27.30%

发文量

审稿时长

49 days

期刊介绍： As the official publication of the French, Belgian and Swiss Societies for Surgery of the Hand, as well as of the French Society of Rehabilitation of the Hand & Upper Limb, ''Hand Surgery and Rehabilitation'' - formerly named "Chirurgie de la Main" - publishes original articles, literature reviews, technical notes, and clinical cases. It is indexed in the main international databases (including Medline). Initially a platform for French-speaking hand surgeons, the journal will now publish its articles in English to disseminate its author''s scientific findings more widely. The journal also includes a biannual supplement in French, the monograph of the French Society for Surgery of the Hand, where comprehensive reviews in the fields of hand, peripheral nerve and upper limb surgery are presented. Organe officiel de la Société française de chirurgie de la main, de la Société française de Rééducation de la main (SFRM-GEMMSOR), de la Société suisse de chirurgie de la main et du Belgian Hand Group, indexée dans les grandes bases de données internationales (Medline, Embase, Pascal, Scopus), Hand Surgery and Rehabilitation - anciennement titrée Chirurgie de la main - publie des articles originaux, des revues de la littérature, des notes techniques, des cas clinique. Initialement plateforme d''expression francophone de la spécialité, la revue s''oriente désormais vers l''anglais pour devenir une référence scientifique et de formation de la spécialité en France et en Europe. Avec 6 publications en anglais par an, la revue comprend également un supplément biannuel, la monographie du GEM, où sont présentées en français, des mises au point complètes dans les domaines de la chirurgie de la main, des nerfs périphériques et du membre supérieur.