Patrick P Nian, Amith Umesh, Shae K Simpson, Olivia C Tracey, Erikson Nichols, Stephanie Logterman, Shevaun M Doyle, Jessica H Heyer
{"title":"儿童肱骨髁上和股骨骨干骨折:Chat生成预训练Transformer和谷歌Gemini推荐与美国骨科学会临床实践指南的比较分析","authors":"Patrick P Nian, Amith Umesh, Shae K Simpson, Olivia C Tracey, Erikson Nichols, Stephanie Logterman, Shevaun M Doyle, Jessica H Heyer","doi":"10.1097/BPO.0000000000002890","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Artificial intelligence (AI) chatbots, including chat generative pretrained transformer (ChatGPT) and Google Gemini, have significantly increased access to medical information. However, in pediatric orthopaedics, no study has evaluated the accuracy of AI chatbots compared with evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on pediatric supracondylar humerus and diaphyseal femur fractures regarding accuracy, supplementary and incomplete response patterns, and readability.</p><p><strong>Methods: </strong>ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 13 evidence-based recommendations (6 from the 2011 AAOS CPG on pediatric supracondylar humerus fractures; 7 from the 2020 AAOS CPG on pediatric diaphyseal femur fractures). Responses were anonymized and independently evaluated by 2 pediatric orthopaedic attending surgeons. Supplementary responses were, in addition, evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen Kappa interrater reliability (κ) was calculated. χ2 analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P <0.05.</p><p><strong>Results: </strong>ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 11/13, 9/13, and 11/13, supplementary in 13/13, 11/13, and 13/13, and incomplete in 3/13, 4/13, and 4/13 recommendations, respectively. Of 37 supplementary responses, 17 (45.9%), 19 (51.4%), and 1 (2.7%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.533), supplementary responses (P = 0.121), necessary modifications (P = 0.580), and incomplete responses (P = 0.881). Overall κ was moderate at 0.55. ChatGPT-3.5 provided shorter responses (P = 0.002), but Google Gemini was more readable in terms of Flesch-Kincaid Grade Level (P = 0.002), Flesch Reading Ease (P < 0.001), and Gunning Fog Index (P = 0.021).</p><p><strong>Conclusions: </strong>While AI chatbots provided responses with reasonable accuracy, most supplemental information required modification and had complex readability. Improvements are necessary before AI chatbots can be reliably used for patient education.</p><p><strong>Level of evidence: </strong>Level IV.</p>","PeriodicalId":16945,"journal":{"name":"Journal of Pediatric Orthopaedics","volume":" ","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.\",\"authors\":\"Patrick P Nian, Amith Umesh, Shae K Simpson, Olivia C Tracey, Erikson Nichols, Stephanie Logterman, Shevaun M Doyle, Jessica H Heyer\",\"doi\":\"10.1097/BPO.0000000000002890\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>Artificial intelligence (AI) chatbots, including chat generative pretrained transformer (ChatGPT) and Google Gemini, have significantly increased access to medical information. However, in pediatric orthopaedics, no study has evaluated the accuracy of AI chatbots compared with evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on pediatric supracondylar humerus and diaphyseal femur fractures regarding accuracy, supplementary and incomplete response patterns, and readability.</p><p><strong>Methods: </strong>ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 13 evidence-based recommendations (6 from the 2011 AAOS CPG on pediatric supracondylar humerus fractures; 7 from the 2020 AAOS CPG on pediatric diaphyseal femur fractures). Responses were anonymized and independently evaluated by 2 pediatric orthopaedic attending surgeons. Supplementary responses were, in addition, evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen Kappa interrater reliability (κ) was calculated. χ2 analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P <0.05.</p><p><strong>Results: </strong>ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 11/13, 9/13, and 11/13, supplementary in 13/13, 11/13, and 13/13, and incomplete in 3/13, 4/13, and 4/13 recommendations, respectively. Of 37 supplementary responses, 17 (45.9%), 19 (51.4%), and 1 (2.7%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.533), supplementary responses (P = 0.121), necessary modifications (P = 0.580), and incomplete responses (P = 0.881). Overall κ was moderate at 0.55. ChatGPT-3.5 provided shorter responses (P = 0.002), but Google Gemini was more readable in terms of Flesch-Kincaid Grade Level (P = 0.002), Flesch Reading Ease (P < 0.001), and Gunning Fog Index (P = 0.021).</p><p><strong>Conclusions: </strong>While AI chatbots provided responses with reasonable accuracy, most supplemental information required modification and had complex readability. Improvements are necessary before AI chatbots can be reliably used for patient education.</p><p><strong>Level of evidence: </strong>Level IV.</p>\",\"PeriodicalId\":16945,\"journal\":{\"name\":\"Journal of Pediatric Orthopaedics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Pediatric Orthopaedics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/BPO.0000000000002890\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Pediatric Orthopaedics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/BPO.0000000000002890","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.
Objective: Artificial intelligence (AI) chatbots, including chat generative pretrained transformer (ChatGPT) and Google Gemini, have significantly increased access to medical information. However, in pediatric orthopaedics, no study has evaluated the accuracy of AI chatbots compared with evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on pediatric supracondylar humerus and diaphyseal femur fractures regarding accuracy, supplementary and incomplete response patterns, and readability.
Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 13 evidence-based recommendations (6 from the 2011 AAOS CPG on pediatric supracondylar humerus fractures; 7 from the 2020 AAOS CPG on pediatric diaphyseal femur fractures). Responses were anonymized and independently evaluated by 2 pediatric orthopaedic attending surgeons. Supplementary responses were, in addition, evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen Kappa interrater reliability (κ) was calculated. χ2 analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P <0.05.
Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 11/13, 9/13, and 11/13, supplementary in 13/13, 11/13, and 13/13, and incomplete in 3/13, 4/13, and 4/13 recommendations, respectively. Of 37 supplementary responses, 17 (45.9%), 19 (51.4%), and 1 (2.7%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.533), supplementary responses (P = 0.121), necessary modifications (P = 0.580), and incomplete responses (P = 0.881). Overall κ was moderate at 0.55. ChatGPT-3.5 provided shorter responses (P = 0.002), but Google Gemini was more readable in terms of Flesch-Kincaid Grade Level (P = 0.002), Flesch Reading Ease (P < 0.001), and Gunning Fog Index (P = 0.021).
Conclusions: While AI chatbots provided responses with reasonable accuracy, most supplemental information required modification and had complex readability. Improvements are necessary before AI chatbots can be reliably used for patient education.
期刊介绍:
Journal of Pediatric Orthopaedics is a leading journal that focuses specifically on traumatic injuries to give you hands-on on coverage of a fast-growing field. You''ll get articles that cover everything from the nature of injury to the effects of new drug therapies; everything from recommendations for more effective surgical approaches to the latest laboratory findings.