Yibin B Zhang, Fielding S Fischer, Matthew V Abola, Daniel A Osei, Scott W Wolfe, Troy B Amen
{"title":"ChatGPT和Gemini的建议是否符合手部和上肢手术的既定指南?","authors":"Yibin B Zhang, Fielding S Fischer, Matthew V Abola, Daniel A Osei, Scott W Wolfe, Troy B Amen","doi":"10.1177/15589447251371089","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The use of large language models (LLMs) such as ChatGPT and Gemini in clinical settings has surged, presenting potential benefits in reducing administrative workload and enhancing patient communication. However, concerns about the clinical accuracy of these tools persist. This study evaluated the concordance of ChatGPT and Gemini's recommendations with American Academy of Orthopedic Surgeons (AAOS) clinical practice guidelines (CPGs) for carpal tunnel syndrome, distal radius fractures, and glenohumeral joint osteoarthritis.</p><p><strong>Methods: </strong>ChatGPT (version 4o) and Gemini (version 1.5 Flash) were queried using structured text-based prompts aligned with AAOS CPGs. The LLMs' outputs were analyzed by blinded reviewers to determine concordance with the guidelines. Concordance rates were compared across models, topics, and guideline strength using descriptive statistics and McNemar's test. The transparency of responses, including source citation, was also assessed.</p><p><strong>Results: </strong>A total of 174 recommendations were generated, with an overall concordance rate of 62.1%. When comparing concordance rates between LLMs, there was no statistically significant difference between ChatGPT and Gemini (66.7% vs 57.5%, <i>P</i> = .131). Concordance varied by topic and guideline strength, with ChatGPT performing best for moderately supported guidelines. Both models demonstrated low citation transparency. Gemini provided sources for 39.1% of recommendations, significantly more than ChatGPT's 3.5% (<i>P</i> < .0001).</p><p><strong>Conclusions: </strong>Despite modest concordance rates, both models exhibited significant limitations, including variability across topics and guideline strengths, as well as insufficient citation transparency. These findings highlight the challenges in integrating LLMs into clinical practice and emphasize the need for further refinement and evaluation before adoption in hand surgery.</p>","PeriodicalId":12902,"journal":{"name":"HAND","volume":" ","pages":"15589447251371089"},"PeriodicalIF":1.8000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12446276/pdf/","citationCount":"0","resultStr":"{\"title\":\"Do ChatGPT and Gemini's Recommendations Align With Established Guidelines for Hand and Upper Extremity Surgery?\",\"authors\":\"Yibin B Zhang, Fielding S Fischer, Matthew V Abola, Daniel A Osei, Scott W Wolfe, Troy B Amen\",\"doi\":\"10.1177/15589447251371089\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>The use of large language models (LLMs) such as ChatGPT and Gemini in clinical settings has surged, presenting potential benefits in reducing administrative workload and enhancing patient communication. However, concerns about the clinical accuracy of these tools persist. This study evaluated the concordance of ChatGPT and Gemini's recommendations with American Academy of Orthopedic Surgeons (AAOS) clinical practice guidelines (CPGs) for carpal tunnel syndrome, distal radius fractures, and glenohumeral joint osteoarthritis.</p><p><strong>Methods: </strong>ChatGPT (version 4o) and Gemini (version 1.5 Flash) were queried using structured text-based prompts aligned with AAOS CPGs. The LLMs' outputs were analyzed by blinded reviewers to determine concordance with the guidelines. Concordance rates were compared across models, topics, and guideline strength using descriptive statistics and McNemar's test. The transparency of responses, including source citation, was also assessed.</p><p><strong>Results: </strong>A total of 174 recommendations were generated, with an overall concordance rate of 62.1%. When comparing concordance rates between LLMs, there was no statistically significant difference between ChatGPT and Gemini (66.7% vs 57.5%, <i>P</i> = .131). Concordance varied by topic and guideline strength, with ChatGPT performing best for moderately supported guidelines. Both models demonstrated low citation transparency. Gemini provided sources for 39.1% of recommendations, significantly more than ChatGPT's 3.5% (<i>P</i> < .0001).</p><p><strong>Conclusions: </strong>Despite modest concordance rates, both models exhibited significant limitations, including variability across topics and guideline strengths, as well as insufficient citation transparency. These findings highlight the challenges in integrating LLMs into clinical practice and emphasize the need for further refinement and evaluation before adoption in hand surgery.</p>\",\"PeriodicalId\":12902,\"journal\":{\"name\":\"HAND\",\"volume\":\" \",\"pages\":\"15589447251371089\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2025-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12446276/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"HAND\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1177/15589447251371089\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"HAND","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/15589447251371089","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0
摘要
背景:大型语言模型(llm)如ChatGPT和Gemini在临床环境中的使用激增,在减少管理工作量和加强患者沟通方面表现出潜在的好处。然而,对这些工具的临床准确性的担忧仍然存在。本研究评估了ChatGPT和Gemini的建议与美国骨科学会(AAOS)临床实践指南(CPGs)对腕管综合征、桡骨远端骨折和盂肱关节骨性关节炎的一致性。方法:ChatGPT(版本40)和Gemini(版本1.5 Flash)使用与AAOS CPGs对齐的结构化文本提示进行查询。法学硕士的成果由盲法审稿人进行分析,以确定与指南的一致性。采用描述性统计和McNemar检验比较不同模型、主题和指南强度的一致性率。回应的透明度,包括来源引用,也被评估。结果:共生成174条推荐,总体符合率为62.1%。当比较llm之间的一致性率时,ChatGPT与Gemini之间无统计学差异(66.7% vs 57.5%, P = 0.131)。一致性因主题和指南强度而异,ChatGPT在中等支持的指南中表现最佳。两种模型的引用透明度都很低。Gemini为39.1%的推荐提供了来源,显著高于ChatGPT的3.5% (P < 0.0001)。结论:尽管有适度的一致性率,但两种模型都表现出显著的局限性,包括不同主题和指南强度的差异,以及引文透明度不足。这些发现强调了将llm整合到临床实践中的挑战,并强调了在手外科应用前需要进一步改进和评估。
Do ChatGPT and Gemini's Recommendations Align With Established Guidelines for Hand and Upper Extremity Surgery?
Background: The use of large language models (LLMs) such as ChatGPT and Gemini in clinical settings has surged, presenting potential benefits in reducing administrative workload and enhancing patient communication. However, concerns about the clinical accuracy of these tools persist. This study evaluated the concordance of ChatGPT and Gemini's recommendations with American Academy of Orthopedic Surgeons (AAOS) clinical practice guidelines (CPGs) for carpal tunnel syndrome, distal radius fractures, and glenohumeral joint osteoarthritis.
Methods: ChatGPT (version 4o) and Gemini (version 1.5 Flash) were queried using structured text-based prompts aligned with AAOS CPGs. The LLMs' outputs were analyzed by blinded reviewers to determine concordance with the guidelines. Concordance rates were compared across models, topics, and guideline strength using descriptive statistics and McNemar's test. The transparency of responses, including source citation, was also assessed.
Results: A total of 174 recommendations were generated, with an overall concordance rate of 62.1%. When comparing concordance rates between LLMs, there was no statistically significant difference between ChatGPT and Gemini (66.7% vs 57.5%, P = .131). Concordance varied by topic and guideline strength, with ChatGPT performing best for moderately supported guidelines. Both models demonstrated low citation transparency. Gemini provided sources for 39.1% of recommendations, significantly more than ChatGPT's 3.5% (P < .0001).
Conclusions: Despite modest concordance rates, both models exhibited significant limitations, including variability across topics and guideline strengths, as well as insufficient citation transparency. These findings highlight the challenges in integrating LLMs into clinical practice and emphasize the need for further refinement and evaluation before adoption in hand surgery.
期刊介绍:
HAND is the official journal of the American Association for Hand Surgery and is a peer-reviewed journal featuring articles written by clinicians worldwide presenting current research and clinical work in the field of hand surgery. It features articles related to all aspects of hand and upper extremity surgery and the post operative care and rehabilitation of the hand.