儿童肱骨髁上和股骨骨干骨折：Chat生成预训练Transformer和谷歌Gemini推荐与美国骨科学会临床实践指南的比较分析

IF 1.4 3区医学 Q3 ORTHOPEDICS

Journal of Pediatric Orthopaedics Pub Date : 2025-04-01 Epub Date: 2025-01-14 DOI:10.1097/BPO.0000000000002890

Patrick P Nian, Amith Umesh, Shae K Simpson, Olivia C Tracey, Erikson Nichols, Stephanie Logterman, Shevaun M Doyle, Jessica H Heyer

{"title":"儿童肱骨髁上和股骨骨干骨折：Chat生成预训练Transformer和谷歌Gemini推荐与美国骨科学会临床实践指南的比较分析","authors":"Patrick P Nian, Amith Umesh, Shae K Simpson, Olivia C Tracey, Erikson Nichols, Stephanie Logterman, Shevaun M Doyle, Jessica H Heyer","doi":"10.1097/BPO.0000000000002890","DOIUrl":null,"url":null,"abstract":"Objective: Artificial intelligence (AI) chatbots, including chat generative pretrained transformer (ChatGPT) and Google Gemini, have significantly increased access to medical information. However, in pediatric orthopaedics, no study has evaluated the accuracy of AI chatbots compared with evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on pediatric supracondylar humerus and diaphyseal femur fractures regarding accuracy, supplementary and incomplete response patterns, and readability.Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 13 evidence-based recommendations (6 from the 2011 AAOS CPG on pediatric supracondylar humerus fractures; 7 from the 2020 AAOS CPG on pediatric diaphyseal femur fractures). Responses were anonymized and independently evaluated by 2 pediatric orthopaedic attending surgeons. Supplementary responses were, in addition, evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen Kappa interrater reliability (κ) was calculated. χ 2 analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P <0.05.Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 11/13, 9/13, and 11/13, supplementary in 13/13, 11/13, and 13/13, and incomplete in 3/13, 4/13, and 4/13 recommendations, respectively. Of 37 supplementary responses, 17 (45.9%), 19 (51.4%), and 1 (2.7%) required no, some, and many modifications, respectively. There were no significant differences in accuracy ( P = 0.533), supplementary responses ( P = 0.121), necessary modifications ( P = 0.580), and incomplete responses ( P = 0.881). Overall κ was moderate at 0.55. ChatGPT-3.5 provided shorter responses ( P = 0.002), but Google Gemini was more readable in terms of Flesch-Kincaid Grade Level ( P = 0.002), Flesch Reading Ease ( P < 0.001), and Gunning Fog Index ( P = 0.021).Conclusions: While AI chatbots provided responses with reasonable accuracy, most supplemental information required modification and had complex readability. Improvements are necessary before AI chatbots can be reliably used for patient education.Level of evidence: Level IV.","PeriodicalId":16945,"journal":{"name":"Journal of Pediatric Orthopaedics","volume":" ","pages":"e338-e344"},"PeriodicalIF":1.4000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.\",\"authors\":\"Patrick P Nian, Amith Umesh, Shae K Simpson, Olivia C Tracey, Erikson Nichols, Stephanie Logterman, Shevaun M Doyle, Jessica H Heyer\",\"doi\":\"10.1097/BPO.0000000000002890\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: Artificial intelligence (AI) chatbots, including chat generative pretrained transformer (ChatGPT) and Google Gemini, have significantly increased access to medical information. However, in pediatric orthopaedics, no study has evaluated the accuracy of AI chatbots compared with evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on pediatric supracondylar humerus and diaphyseal femur fractures regarding accuracy, supplementary and incomplete response patterns, and readability.Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 13 evidence-based recommendations (6 from the 2011 AAOS CPG on pediatric supracondylar humerus fractures; 7 from the 2020 AAOS CPG on pediatric diaphyseal femur fractures). Responses were anonymized and independently evaluated by 2 pediatric orthopaedic attending surgeons. Supplementary responses were, in addition, evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen Kappa interrater reliability (κ) was calculated. χ 2 analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P <0.05.Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 11/13, 9/13, and 11/13, supplementary in 13/13, 11/13, and 13/13, and incomplete in 3/13, 4/13, and 4/13 recommendations, respectively. Of 37 supplementary responses, 17 (45.9%), 19 (51.4%), and 1 (2.7%) required no, some, and many modifications, respectively. There were no significant differences in accuracy ( P = 0.533), supplementary responses ( P = 0.121), necessary modifications ( P = 0.580), and incomplete responses ( P = 0.881). Overall κ was moderate at 0.55. ChatGPT-3.5 provided shorter responses ( P = 0.002), but Google Gemini was more readable in terms of Flesch-Kincaid Grade Level ( P = 0.002), Flesch Reading Ease ( P < 0.001), and Gunning Fog Index ( P = 0.021).Conclusions: While AI chatbots provided responses with reasonable accuracy, most supplemental information required modification and had complex readability. Improvements are necessary before AI chatbots can be reliably used for patient education.Level of evidence: Level IV.\",\"PeriodicalId\":16945,\"journal\":{\"name\":\"Journal of Pediatric Orthopaedics\",\"volume\":\" \",\"pages\":\"e338-e344\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Pediatric Orthopaedics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/BPO.0000000000002890\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/14 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q3\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pediatric Orthopaedics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/BPO.0000000000002890","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/14 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

摘要

目的：人工智能（AI）聊天机器人，包括聊天生成预训练变压器（ChatGPT）和谷歌Gemini，显著增加了对医疗信息的获取。然而，在儿童骨科领域，没有研究评估人工智能聊天机器人与循证建议的准确性，包括美国骨科医师学会临床实践指南（AAOS CPGs）。本研究的目的是比较ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini与AAOS CPG推荐的儿童肱骨髁上骨折和股骨骨干骨折的反应，包括准确性、补充和不完全反应模式以及可读性。方法：ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini是根据13项循证建议(6项来自2011年AAOS CPG关于儿童肱骨髁上骨折；（摘自2020年AAOS CPG关于小儿股骨骨干骨折的第7篇）。反馈是匿名的，由2名儿科骨科主治医生独立评估。此外，还对补充答复进行了评估，以确定是否需要进行一些修改或许多修改。比较可读性指标（反应长度、Flesch- kincaid阅读水平、Flesch阅读Ease、Gunning Fog Index）。计算Cohen Kappa互信度（κ）。分类变量与连续变量的比较分别采用χ2分析和单因素方差分析。结果：ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini分别在11/13、9/13和11/13中准确，在13/13、11/13和13/13中补充，在3/13、4/13和4/13推荐中不完全。在37份补充答复中，17份（45.9%）、19份（51.4%）和1份（2.7%）分别要求不修改、部分修改和大量修改。在准确性（P = 0.533）、补充应答（P = 0.121）、必要修正（P = 0.580）和不完全应答（P = 0.881）方面差异无统计学意义。总体κ为中等水平，为0.55。ChatGPT-3.5提供了较短的回答（P = 0.002），但谷歌Gemini在Flesch- kincaid Grade Level （P = 0.002）、Flesch Reading Ease （P < 0.001）和Gunning Fog Index （P = 0.021）方面更具可读性。结论：虽然AI聊天机器人提供的回答具有合理的准确性，但大多数补充信息需要修改，并且具有复杂的可读性。在人工智能聊天机器人能够可靠地用于患者教育之前，有必要进行改进。证据等级：四级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.

Objective: Artificial intelligence (AI) chatbots, including chat generative pretrained transformer (ChatGPT) and Google Gemini, have significantly increased access to medical information. However, in pediatric orthopaedics, no study has evaluated the accuracy of AI chatbots compared with evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on pediatric supracondylar humerus and diaphyseal femur fractures regarding accuracy, supplementary and incomplete response patterns, and readability.

Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 13 evidence-based recommendations (6 from the 2011 AAOS CPG on pediatric supracondylar humerus fractures; 7 from the 2020 AAOS CPG on pediatric diaphyseal femur fractures). Responses were anonymized and independently evaluated by 2 pediatric orthopaedic attending surgeons. Supplementary responses were, in addition, evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen Kappa interrater reliability (κ) was calculated. χ 2 analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P <0.05.

Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 11/13, 9/13, and 11/13, supplementary in 13/13, 11/13, and 13/13, and incomplete in 3/13, 4/13, and 4/13 recommendations, respectively. Of 37 supplementary responses, 17 (45.9%), 19 (51.4%), and 1 (2.7%) required no, some, and many modifications, respectively. There were no significant differences in accuracy ( P = 0.533), supplementary responses ( P = 0.121), necessary modifications ( P = 0.580), and incomplete responses ( P = 0.881). Overall κ was moderate at 0.55. ChatGPT-3.5 provided shorter responses ( P = 0.002), but Google Gemini was more readable in terms of Flesch-Kincaid Grade Level ( P = 0.002), Flesch Reading Ease ( P < 0.001), and Gunning Fog Index ( P = 0.021).

Conclusions: While AI chatbots provided responses with reasonable accuracy, most supplemental information required modification and had complex readability. Improvements are necessary before AI chatbots can be reliably used for patient education.

Level of evidence: Level IV.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Pediatric Orthopaedics 医学-小儿科

CiteScore

3.30

自引率

17.60%

发文量

512

审稿时长

6 months

期刊介绍： Journal of Pediatric Orthopaedics is a leading journal that focuses specifically on traumatic injuries to give you hands-on on coverage of a fast-growing field. You''ll get articles that cover everything from the nature of injury to the effects of new drug therapies; everything from recommendations for more effective surgical approaches to the latest laboratory findings.