儿童肱骨髁上和股骨骨干骨折:Chat生成预训练Transformer和谷歌Gemini推荐与美国骨科学会临床实践指南的比较分析

IF 1.4 3区 医学 Q3 ORTHOPEDICS
Patrick P Nian, Amith Umesh, Shae K Simpson, Olivia C Tracey, Erikson Nichols, Stephanie Logterman, Shevaun M Doyle, Jessica H Heyer
{"title":"儿童肱骨髁上和股骨骨干骨折:Chat生成预训练Transformer和谷歌Gemini推荐与美国骨科学会临床实践指南的比较分析","authors":"Patrick P Nian, Amith Umesh, Shae K Simpson, Olivia C Tracey, Erikson Nichols, Stephanie Logterman, Shevaun M Doyle, Jessica H Heyer","doi":"10.1097/BPO.0000000000002890","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Artificial intelligence (AI) chatbots, including chat generative pretrained transformer (ChatGPT) and Google Gemini, have significantly increased access to medical information. However, in pediatric orthopaedics, no study has evaluated the accuracy of AI chatbots compared with evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on pediatric supracondylar humerus and diaphyseal femur fractures regarding accuracy, supplementary and incomplete response patterns, and readability.</p><p><strong>Methods: </strong>ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 13 evidence-based recommendations (6 from the 2011 AAOS CPG on pediatric supracondylar humerus fractures; 7 from the 2020 AAOS CPG on pediatric diaphyseal femur fractures). Responses were anonymized and independently evaluated by 2 pediatric orthopaedic attending surgeons. Supplementary responses were, in addition, evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen Kappa interrater reliability (κ) was calculated. χ2 analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P <0.05.</p><p><strong>Results: </strong>ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 11/13, 9/13, and 11/13, supplementary in 13/13, 11/13, and 13/13, and incomplete in 3/13, 4/13, and 4/13 recommendations, respectively. Of 37 supplementary responses, 17 (45.9%), 19 (51.4%), and 1 (2.7%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.533), supplementary responses (P = 0.121), necessary modifications (P = 0.580), and incomplete responses (P = 0.881). Overall κ was moderate at 0.55. ChatGPT-3.5 provided shorter responses (P = 0.002), but Google Gemini was more readable in terms of Flesch-Kincaid Grade Level (P = 0.002), Flesch Reading Ease (P < 0.001), and Gunning Fog Index (P = 0.021).</p><p><strong>Conclusions: </strong>While AI chatbots provided responses with reasonable accuracy, most supplemental information required modification and had complex readability. Improvements are necessary before AI chatbots can be reliably used for patient education.</p><p><strong>Level of evidence: </strong>Level IV.</p>","PeriodicalId":16945,"journal":{"name":"Journal of Pediatric Orthopaedics","volume":" ","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.\",\"authors\":\"Patrick P Nian, Amith Umesh, Shae K Simpson, Olivia C Tracey, Erikson Nichols, Stephanie Logterman, Shevaun M Doyle, Jessica H Heyer\",\"doi\":\"10.1097/BPO.0000000000002890\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>Artificial intelligence (AI) chatbots, including chat generative pretrained transformer (ChatGPT) and Google Gemini, have significantly increased access to medical information. However, in pediatric orthopaedics, no study has evaluated the accuracy of AI chatbots compared with evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on pediatric supracondylar humerus and diaphyseal femur fractures regarding accuracy, supplementary and incomplete response patterns, and readability.</p><p><strong>Methods: </strong>ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 13 evidence-based recommendations (6 from the 2011 AAOS CPG on pediatric supracondylar humerus fractures; 7 from the 2020 AAOS CPG on pediatric diaphyseal femur fractures). Responses were anonymized and independently evaluated by 2 pediatric orthopaedic attending surgeons. Supplementary responses were, in addition, evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen Kappa interrater reliability (κ) was calculated. χ2 analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P <0.05.</p><p><strong>Results: </strong>ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 11/13, 9/13, and 11/13, supplementary in 13/13, 11/13, and 13/13, and incomplete in 3/13, 4/13, and 4/13 recommendations, respectively. Of 37 supplementary responses, 17 (45.9%), 19 (51.4%), and 1 (2.7%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.533), supplementary responses (P = 0.121), necessary modifications (P = 0.580), and incomplete responses (P = 0.881). Overall κ was moderate at 0.55. ChatGPT-3.5 provided shorter responses (P = 0.002), but Google Gemini was more readable in terms of Flesch-Kincaid Grade Level (P = 0.002), Flesch Reading Ease (P < 0.001), and Gunning Fog Index (P = 0.021).</p><p><strong>Conclusions: </strong>While AI chatbots provided responses with reasonable accuracy, most supplemental information required modification and had complex readability. Improvements are necessary before AI chatbots can be reliably used for patient education.</p><p><strong>Level of evidence: </strong>Level IV.</p>\",\"PeriodicalId\":16945,\"journal\":{\"name\":\"Journal of Pediatric Orthopaedics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Pediatric Orthopaedics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/BPO.0000000000002890\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pediatric Orthopaedics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/BPO.0000000000002890","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0

摘要

目的:人工智能(AI)聊天机器人,包括聊天生成预训练变压器(ChatGPT)和谷歌Gemini,显著增加了对医疗信息的获取。然而,在儿童骨科领域,没有研究评估人工智能聊天机器人与循证建议的准确性,包括美国骨科医师学会临床实践指南(AAOS CPGs)。本研究的目的是比较ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini与AAOS CPG推荐的儿童肱骨髁上骨折和股骨骨干骨折的反应,包括准确性、补充和不完全反应模式以及可读性。方法:ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini是根据13项循证建议(6项来自2011年AAOS CPG关于儿童肱骨髁上骨折;(摘自2020年AAOS CPG关于小儿股骨骨干骨折的第7篇)。反馈是匿名的,由2名儿科骨科主治医生独立评估。此外,还对补充答复进行了评估,以确定是否需要进行一些修改或许多修改。比较可读性指标(反应长度、Flesch- kincaid阅读水平、Flesch阅读Ease、Gunning Fog Index)。计算Cohen Kappa互信度(κ)。分类变量与连续变量的比较分别采用χ2分析和单因素方差分析。结果:ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini分别在11/13、9/13和11/13中准确,在13/13、11/13和13/13中补充,在3/13、4/13和4/13推荐中不完全。在37份补充答复中,17份(45.9%)、19份(51.4%)和1份(2.7%)分别要求不修改、部分修改和大量修改。在准确性(P = 0.533)、补充应答(P = 0.121)、必要修正(P = 0.580)和不完全应答(P = 0.881)方面差异无统计学意义。总体κ为中等水平,为0.55。ChatGPT-3.5提供了较短的回答(P = 0.002),但谷歌Gemini在Flesch- kincaid Grade Level (P = 0.002)、Flesch Reading Ease (P < 0.001)和Gunning Fog Index (P = 0.021)方面更具可读性。结论:虽然AI聊天机器人提供的回答具有合理的准确性,但大多数补充信息需要修改,并且具有复杂的可读性。在人工智能聊天机器人能够可靠地用于患者教育之前,有必要进行改进。证据等级:四级。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.

Objective: Artificial intelligence (AI) chatbots, including chat generative pretrained transformer (ChatGPT) and Google Gemini, have significantly increased access to medical information. However, in pediatric orthopaedics, no study has evaluated the accuracy of AI chatbots compared with evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on pediatric supracondylar humerus and diaphyseal femur fractures regarding accuracy, supplementary and incomplete response patterns, and readability.

Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 13 evidence-based recommendations (6 from the 2011 AAOS CPG on pediatric supracondylar humerus fractures; 7 from the 2020 AAOS CPG on pediatric diaphyseal femur fractures). Responses were anonymized and independently evaluated by 2 pediatric orthopaedic attending surgeons. Supplementary responses were, in addition, evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen Kappa interrater reliability (κ) was calculated. χ2 analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P <0.05.

Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 11/13, 9/13, and 11/13, supplementary in 13/13, 11/13, and 13/13, and incomplete in 3/13, 4/13, and 4/13 recommendations, respectively. Of 37 supplementary responses, 17 (45.9%), 19 (51.4%), and 1 (2.7%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.533), supplementary responses (P = 0.121), necessary modifications (P = 0.580), and incomplete responses (P = 0.881). Overall κ was moderate at 0.55. ChatGPT-3.5 provided shorter responses (P = 0.002), but Google Gemini was more readable in terms of Flesch-Kincaid Grade Level (P = 0.002), Flesch Reading Ease (P < 0.001), and Gunning Fog Index (P = 0.021).

Conclusions: While AI chatbots provided responses with reasonable accuracy, most supplemental information required modification and had complex readability. Improvements are necessary before AI chatbots can be reliably used for patient education.

Level of evidence: Level IV.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
3.30
自引率
17.60%
发文量
512
审稿时长
6 months
期刊介绍: ​Journal of Pediatric Orthopaedics is a leading journal that focuses specifically on traumatic injuries to give you hands-on on coverage of a fast-growing field. You''ll get articles that cover everything from the nature of injury to the effects of new drug therapies; everything from recommendations for more effective surgical approaches to the latest laboratory findings.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信