与美国骨科医师学会临床实践指南相比，ChatGPT和谷歌Gemini在提供髋关节发育不良管理建议方面存在临床不足。

Journal of the Pediatric Orthopaedic Society of North America Pub Date : 2024-12-09 eCollection Date: 2025-02-01 DOI:10.1016/j.jposna.2024.100135

Patrick P Nian, Amith Umesh, Ruth H Jones, Akshitha Adhiyaman, Christopher J Williams, Christine M Goodbody, Jessica H Heyer, Shevaun M Doyle

{"title":"与美国骨科医师学会临床实践指南相比，ChatGPT和谷歌Gemini在提供髋关节发育不良管理建议方面存在临床不足。","authors":"Patrick P Nian, Amith Umesh, Ruth H Jones, Akshitha Adhiyaman, Christopher J Williams, Christine M Goodbody, Jessica H Heyer, Shevaun M Doyle","doi":"10.1016/j.jposna.2024.100135","DOIUrl":null,"url":null,"abstract":"Background: Large language models, including Chat Generative Pre-trained Transformer (ChatGPT) and Google Gemini have accelerated public accessibility to information, but their accuracy to medical questions remains unknown. In pediatric orthopaedics, no study has utilized board-certified expert opinion to evaluate the accuracy of artificial intelligence (AI) chatbots compared to evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on developmental dysplasia of the hip (DDH) regarding accuracy, supplementary and incomplete response patterns, and readability.Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 9 evidence-based recommendations from the 2022 AAOS CPG on DDH. The answers to these questions were obtained on July 1st, 2024. Responses were anonymized and independently evaluated by two pediatric orthopaedic attending surgeons. Supplementary responses were additionally evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen's Kappa inter-rater reliability (κ) was calculated. Chi-square analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P < 0.05.Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 5/9, 6/9, 6/9, supplementary in 8/9, 7/9, 9/9, and incomplete in 7/9, 6/9, 7/9 recommendations, respectively. Of 24 supplementary responses, 5 (20.8%), 16 (66.7%), and 3 (12.5%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.853), supplementary responses (P = 0.325), necessary modifications (P = 0.661), and incomplete responses (P = 0.825). κ was highest for accuracy at 0.17. Google Gemini was significantly more readable in Flesch-Kincaid reading level, Flesch Reading Ease, and Gunning fog index (all, P < 0.05).Conclusions: In the setting of DDH, AI chatbots demonstrated limited accuracy, high supplementary and incomplete response patterns, and complex readability. Pediatric orthopaedic surgeons can counsel patients and their families to set appropriate expectations on the utility of these novel tools.Key concepts: (1)Responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were inadequately accurate, frequently provided supplementary information that required modifications and frequently lacked essential details from the AAOS CPGs on DDH.(2)Accurate, supplementary, and incomplete response patterns were not significantly different among the three chatbots.(3)Google Gemini provided responses that had the highest readability among the three chatbots.(4)Pediatric orthopaedic surgeons can play a role in counseling patients and their families on the limited utility of AI chatbots for patient education purposes.Level of evidence: IV.","PeriodicalId":520850,"journal":{"name":"Journal of the Pediatric Orthopaedic Society of North America","volume":"10 ","pages":"100135"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12088106/pdf/","citationCount":"0","resultStr":"{\"title\":\"ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.\",\"authors\":\"Patrick P Nian, Amith Umesh, Ruth H Jones, Akshitha Adhiyaman, Christopher J Williams, Christine M Goodbody, Jessica H Heyer, Shevaun M Doyle\",\"doi\":\"10.1016/j.jposna.2024.100135\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Large language models, including Chat Generative Pre-trained Transformer (ChatGPT) and Google Gemini have accelerated public accessibility to information, but their accuracy to medical questions remains unknown. In pediatric orthopaedics, no study has utilized board-certified expert opinion to evaluate the accuracy of artificial intelligence (AI) chatbots compared to evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on developmental dysplasia of the hip (DDH) regarding accuracy, supplementary and incomplete response patterns, and readability.Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 9 evidence-based recommendations from the 2022 AAOS CPG on DDH. The answers to these questions were obtained on July 1st, 2024. Responses were anonymized and independently evaluated by two pediatric orthopaedic attending surgeons. Supplementary responses were additionally evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen's Kappa inter-rater reliability (κ) was calculated. Chi-square analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P < 0.05.Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 5/9, 6/9, 6/9, supplementary in 8/9, 7/9, 9/9, and incomplete in 7/9, 6/9, 7/9 recommendations, respectively. Of 24 supplementary responses, 5 (20.8%), 16 (66.7%), and 3 (12.5%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.853), supplementary responses (P = 0.325), necessary modifications (P = 0.661), and incomplete responses (P = 0.825). κ was highest for accuracy at 0.17. Google Gemini was significantly more readable in Flesch-Kincaid reading level, Flesch Reading Ease, and Gunning fog index (all, P < 0.05).Conclusions: In the setting of DDH, AI chatbots demonstrated limited accuracy, high supplementary and incomplete response patterns, and complex readability. Pediatric orthopaedic surgeons can counsel patients and their families to set appropriate expectations on the utility of these novel tools.Key concepts: (1)Responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were inadequately accurate, frequently provided supplementary information that required modifications and frequently lacked essential details from the AAOS CPGs on DDH.(2)Accurate, supplementary, and incomplete response patterns were not significantly different among the three chatbots.(3)Google Gemini provided responses that had the highest readability among the three chatbots.(4)Pediatric orthopaedic surgeons can play a role in counseling patients and their families on the limited utility of AI chatbots for patient education purposes.Level of evidence: IV.\",\"PeriodicalId\":520850,\"journal\":{\"name\":\"Journal of the Pediatric Orthopaedic Society of North America\",\"volume\":\"10 \",\"pages\":\"100135\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12088106/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the Pediatric Orthopaedic Society of North America\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1016/j.jposna.2024.100135\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Pediatric Orthopaedic Society of North America","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.jposna.2024.100135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

背景：包括聊天生成预训练转换器（ChatGPT）和谷歌Gemini在内的大型语言模型加速了公众对信息的获取，但它们对医疗问题的准确性仍然未知。在儿童骨科领域，没有研究利用委员会认证的专家意见来评估人工智能（AI）聊天机器人与循证建议（包括美国骨科医师学会临床实践指南（AAOS CPGs））的准确性。本研究的目的是比较ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini与AAOS CPG推荐的关于发育性髋关节发育不良（DDH）的反应，包括准确性、补充和不完全反应模式以及可读性。方法：ChatGPT-4.0、ChatGPT-3.5和谷歌双子座由2022年AAOS CPG关于DDH的9项循证建议创建的问题提示。这些问题的答案于2024年7月1日揭晓。反馈是匿名的，由两名儿科骨科主治医生独立评估。对补充答复进行额外评估，以确定是否需要进行修改，是否需要进行一些修改或许多修改。比较可读性指标（反应长度、Flesch- kincaid阅读水平、Flesch阅读Ease、Gunning Fog Index）。计算Cohen’s Kappa量表间信度（κ）。分类变量和连续变量的比较分别采用卡方分析和单因素方差分析。结果：ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini在5/9、6/9、6/9建议中准确，在8/9、7/9、9/9建议中补充，在7/9、6/9、7/9建议中不完全。在24份补充答复中，5份（20.8%）、16份（66.7%）和3份（12.5%）分别要求不修改、部分修改和大量修改。在准确性（P = 0.853）、补充应答（P = 0.325）、必要修正（P = 0.661）和不完全应答（P = 0.825）方面差异无统计学意义。κ的准确性最高，为0.17。谷歌Gemini在Flesch- kincaid阅读水平、Flesch reading Ease和Gunning fog指数（均，P）上的可读性显著提高。结论：在DDH设置下，AI聊天机器人表现出有限的准确性、高度补充和不完整的响应模式、复杂的可读性。儿科骨科医生可以建议患者及其家属对这些新工具的效用设定适当的期望。关键概念:(1) ChatGPT-4.0、ChatGPT-3.5和谷歌Gemini的响应不够准确，经常提供需要修改的补充信息，并且经常缺乏AAOS CPGs关于DDH的基本细节。(3) b谷歌Gemini提供的回答在三种聊天机器人中具有最高的可读性。(4)儿科骨科医生可以就人工智能聊天机器人在患者教育方面的有限效用向患者及其家属提供咨询。证据等级：四级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ChatGPT and Google Gemini are Clinically Inadequate in Providing Recommendations on Management of Developmental Dysplasia of the Hip Compared to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.

Background: Large language models, including Chat Generative Pre-trained Transformer (ChatGPT) and Google Gemini have accelerated public accessibility to information, but their accuracy to medical questions remains unknown. In pediatric orthopaedics, no study has utilized board-certified expert opinion to evaluate the accuracy of artificial intelligence (AI) chatbots compared to evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on developmental dysplasia of the hip (DDH) regarding accuracy, supplementary and incomplete response patterns, and readability.

Methods: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 9 evidence-based recommendations from the 2022 AAOS CPG on DDH. The answers to these questions were obtained on July 1st, 2024. Responses were anonymized and independently evaluated by two pediatric orthopaedic attending surgeons. Supplementary responses were additionally evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen's Kappa inter-rater reliability (κ) was calculated. Chi-square analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P < 0.05.

Results: ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 5/9, 6/9, 6/9, supplementary in 8/9, 7/9, 9/9, and incomplete in 7/9, 6/9, 7/9 recommendations, respectively. Of 24 supplementary responses, 5 (20.8%), 16 (66.7%), and 3 (12.5%) required no, some, and many modifications, respectively. There were no significant differences in accuracy (P = 0.853), supplementary responses (P = 0.325), necessary modifications (P = 0.661), and incomplete responses (P = 0.825). κ was highest for accuracy at 0.17. Google Gemini was significantly more readable in Flesch-Kincaid reading level, Flesch Reading Ease, and Gunning fog index (all, P < 0.05).

Conclusions: In the setting of DDH, AI chatbots demonstrated limited accuracy, high supplementary and incomplete response patterns, and complex readability. Pediatric orthopaedic surgeons can counsel patients and their families to set appropriate expectations on the utility of these novel tools.

Key concepts: (1)Responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were inadequately accurate, frequently provided supplementary information that required modifications and frequently lacked essential details from the AAOS CPGs on DDH.(2)Accurate, supplementary, and incomplete response patterns were not significantly different among the three chatbots.(3)Google Gemini provided responses that had the highest readability among the three chatbots.(4)Pediatric orthopaedic surgeons can play a role in counseling patients and their families on the limited utility of AI chatbots for patient education purposes.

Level of evidence: IV.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the Pediatric Orthopaedic Society of North America

自引率

0.00%

发文量