大语言模型在肩袖损伤患者教育和临床决策支持中的评估：一项两阶段基准研究。

IF 3.8 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2025-08-04 DOI:10.1186/s12911-025-03105-5

Yi-Lin Wang, Li-Chao Tian, Jing-Yuan Meng, Jie-Chao Zhang, Zhi-Xing Nie, Wen-Rui Wei, Dao-Fang Ding, Xiao-Ye Tang, Qian Zhang, Yong He

{"title":"大语言模型在肩袖损伤患者教育和临床决策支持中的评估：一项两阶段基准研究。","authors":"Yi-Lin Wang, Li-Chao Tian, Jing-Yuan Meng, Jie-Chao Zhang, Zhi-Xing Nie, Wen-Rui Wei, Dao-Fang Ding, Xiao-Ye Tang, Qian Zhang, Yong He","doi":"10.1186/s12911-025-03105-5","DOIUrl":null,"url":null,"abstract":"Objective: This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions.Methods: Phase 1: Four LLM chatbots answered physician test questions on rotator cuff injuries, interacting with patients and students. Their performance was assessed for accuracy and clarity across 108 multiple-choice and 20 clinical questions. Phase 2: Twenty patients questioned the top two chatbots (ChatGPT-4o, Gemini), with responses rated for satisfaction and readability. Three physicians evaluated accuracy, usefulness, safety, and completeness using a 5-point Likert scale. Statistical analyses and plotting used IBM SPSS 29.0.1.0 and Prism 10; Friedman test compared evaluation and readability scores among chatbots with Bonferroni-corrected pairwise comparisons, Mann-Whitney U test compared ChatGPT-4o versus Gemini; statistical significance at p < 0.05.Results: Gemini achieved the highest average accuracy. In the second part, Gemini showed the highest proficiency in answering rotator cuff injury-related queries (accuracy: 4.70; completeness: 4.72; readability: 4.70; usefulness: 4.61; safety: 4.70, post hoc Dunnett test, p < 0.05). Additionally, 20 rotator cuff injury patients questioned the top two models from Phase 1 (ChatGPT-4o and Gemini). ChatGPT-4o had the highest reading difficulty score (14.22, post hoc Dunnett test, p < 0.05), suggesting a middle school reading level or above. Statistical analysis showed significant differences in patient satisfaction (4.52 vs. 3.76, p < 0.001) and readability (4.35 vs. 4.23). Orthopedic surgeons rated ChatGPT-4o higher in accuracy, completeness, readability, usefulness, and safety (all p < 0.05), outperforming Gemini in all aspects.Conclusion: The study found that LLMs, particularly ChatGPT-4o and Gemini, excelled in understanding rotator cuff injury-related knowledge and responding to patients, showing strong potential for further development.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"289"},"PeriodicalIF":3.8000,"publicationDate":"2025-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12323112/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study.\",\"authors\":\"Yi-Lin Wang, Li-Chao Tian, Jing-Yuan Meng, Jie-Chao Zhang, Zhi-Xing Nie, Wen-Rui Wei, Dao-Fang Ding, Xiao-Ye Tang, Qian Zhang, Yong He\",\"doi\":\"10.1186/s12911-025-03105-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions.Methods: Phase 1: Four LLM chatbots answered physician test questions on rotator cuff injuries, interacting with patients and students. Their performance was assessed for accuracy and clarity across 108 multiple-choice and 20 clinical questions. Phase 2: Twenty patients questioned the top two chatbots (ChatGPT-4o, Gemini), with responses rated for satisfaction and readability. Three physicians evaluated accuracy, usefulness, safety, and completeness using a 5-point Likert scale. Statistical analyses and plotting used IBM SPSS 29.0.1.0 and Prism 10; Friedman test compared evaluation and readability scores among chatbots with Bonferroni-corrected pairwise comparisons, Mann-Whitney U test compared ChatGPT-4o versus Gemini; statistical significance at p < 0.05.Results: Gemini achieved the highest average accuracy. In the second part, Gemini showed the highest proficiency in answering rotator cuff injury-related queries (accuracy: 4.70; completeness: 4.72; readability: 4.70; usefulness: 4.61; safety: 4.70, post hoc Dunnett test, p < 0.05). Additionally, 20 rotator cuff injury patients questioned the top two models from Phase 1 (ChatGPT-4o and Gemini). ChatGPT-4o had the highest reading difficulty score (14.22, post hoc Dunnett test, p < 0.05), suggesting a middle school reading level or above. Statistical analysis showed significant differences in patient satisfaction (4.52 vs. 3.76, p < 0.001) and readability (4.35 vs. 4.23). Orthopedic surgeons rated ChatGPT-4o higher in accuracy, completeness, readability, usefulness, and safety (all p < 0.05), outperforming Gemini in all aspects.Conclusion: The study found that LLMs, particularly ChatGPT-4o and Gemini, excelled in understanding rotator cuff injury-related knowledge and responding to patients, showing strong potential for further development.\",\"PeriodicalId\":9340,\"journal\":{\"name\":\"BMC Medical Informatics and Decision Making\",\"volume\":\"25 1\",\"pages\":\"289\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-08-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12323112/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Informatics and Decision Making\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12911-025-03105-5\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03105-5","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

摘要

目的：本研究评估chatgpt - 40、chatgpt - 01、Gemini和ERNIE Bot在回答肩袖损伤问题和回应患者方面的准确性。结果显示，Gemini在准确性方面表现出色，而chatgpt - 40在患者互动方面表现更好。方法：第一阶段：四个LLM聊天机器人回答医生关于肩袖损伤的测试问题，与患者和学生进行互动。他们的表现被评估为108个选择题和20个临床问题的准确性和清晰度。第二阶段：20名患者对排名前两位的聊天机器人（chatgpt - 40, Gemini）进行了提问，并对回答的满意度和可读性进行了评分。三位医生使用5分李克特量表评估准确性、有效性、安全性和完整性。采用IBM SPSS 29.0.1.0和Prism 10进行统计分析和绘图；Friedman测试用bonferroni校正两两比较比较了聊天机器人的评价和可读性得分，Mann-Whitney U测试比较了chatgpt - 40和Gemini；结果：Gemini达到了最高的平均准确率。在第二部分中，双子座在回答与肩袖损伤相关的问题时表现出最高的熟练程度(准确率：4.70；完整性:4.72;可读性:4.70;实用性:4.61;结论：研究发现llm，特别是chatgpt - 40和Gemini，在理解肩袖损伤相关知识和响应患者方面表现出色，具有进一步发展的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study.

查看原文本刊更多论文

Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study.

Objective: This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions.

Methods: Phase 1: Four LLM chatbots answered physician test questions on rotator cuff injuries, interacting with patients and students. Their performance was assessed for accuracy and clarity across 108 multiple-choice and 20 clinical questions. Phase 2: Twenty patients questioned the top two chatbots (ChatGPT-4o, Gemini), with responses rated for satisfaction and readability. Three physicians evaluated accuracy, usefulness, safety, and completeness using a 5-point Likert scale. Statistical analyses and plotting used IBM SPSS 29.0.1.0 and Prism 10; Friedman test compared evaluation and readability scores among chatbots with Bonferroni-corrected pairwise comparisons, Mann-Whitney U test compared ChatGPT-4o versus Gemini; statistical significance at p < 0.05.

Results: Gemini achieved the highest average accuracy. In the second part, Gemini showed the highest proficiency in answering rotator cuff injury-related queries (accuracy: 4.70; completeness: 4.72; readability: 4.70; usefulness: 4.61; safety: 4.70, post hoc Dunnett test, p < 0.05). Additionally, 20 rotator cuff injury patients questioned the top two models from Phase 1 (ChatGPT-4o and Gemini). ChatGPT-4o had the highest reading difficulty score (14.22, post hoc Dunnett test, p < 0.05), suggesting a middle school reading level or above. Statistical analysis showed significant differences in patient satisfaction (4.52 vs. 3.76, p < 0.001) and readability (4.35 vs. 4.23). Orthopedic surgeons rated ChatGPT-4o higher in accuracy, completeness, readability, usefulness, and safety (all p < 0.05), outperforming Gemini in all aspects.

Conclusion: The study found that LLMs, particularly ChatGPT-4o and Gemini, excelled in understanding rotator cuff injury-related knowledge and responding to patients, showing strong potential for further development.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.