{"title":"大型语言模型驱动神经外科聊天机器人的开发和验证:加强围手术期患者教育的混合方法研究。","authors":"Chung Man Ho, Shaowei Guan, Prudence Kwan-Lam Mok, Candice Hw Lam, Wai Ying Ho, Calvin Hoi-Kwan Mak, Harry Qin, Arkers Kwan Ching Wong, Vivian Hui","doi":"10.2196/74299","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Perioperative education is crucial for optimizing outcomes in neuroendovascular procedures, where inadequate understanding can heighten patient anxiety and hinder care plan adherence. Current education models, reliant on traditional consultations and printed materials, often lack scalability and personalization. Artificial intelligence (AI)-powered chatbots have demonstrated efficacy in various health care contexts; however, their role in neuroendovascular perioperative support remains underexplored. Given the complexity of neuroendovascular procedures and the need for continuous, tailored patient education, AI chatbots have the potential to offer tailored perioperative guidance to improve patient education in this specialty.</p><p><strong>Objective: </strong>We aimed to develop, validate, and assess NeuroBot, an AI-driven system that uses large language models (LLMs) with retrieval-augmented generation to deliver timely, accurate, and evidence-based responses to patient inquiries in neurosurgery, ultimately improving the effectiveness of patient education.</p><p><strong>Methods: </strong>A mixed methods approach was used, consisting of 3 phases. In the first phase, internal validation, we compared the performance of Assistants API, ChatGPT, and Qwen by evaluating their responses to 306 bilingual neuroendovascular-related questions. The accuracy, relevance, and completeness of the responses were evaluated using a Likert scale; statistical analyses included ANOVA and paired t tests. In the second phase, external validation, 10 neurosurgical experts rated the responses generated by NeuroBot using the same evaluation metrics applied in the internal validation phase. The consistency of their ratings was measured using the intraclass correlation coefficient. Finally, in the third phase, a qualitative study was conducted through interviews with 18 health care providers, which helped identify key themes related to the NeuroBot's usability and perceived benefits. Thematic analysis was performed using NVivo and interrater reliability was confirmed through Cohen κ.</p><p><strong>Results: </strong>The Assistants API outperformed both ChatGPT and Qwen, achieving a mean accuracy score of 5.28 out of 6 (95% CI 5.21-5.35), with a statistically significant result (P<.001). External expert ratings for NeuroBot demonstrated significant improvements, with scores of 5.70 out of 6 (95% CI 5.46-5.94) for accuracy, 5.58 out of 6 (95% CI 5.45-5.94) for relevance, and 2.70 out of 3 (95% CI 2.73-2.97) for completeness. Qualitative insights highlighted NeuroBot's potential to reduce staff workload, enhance patient education, and deliver evidence-based responses.</p><p><strong>Conclusions: </strong>NeuroBot, leveraging LLMs with the retrieval-augmented generation technique, demonstrates the potential of LLM-based chatbots in perioperative neuroendovascular care, offering scalable and continuous support. By integrating domain-specific knowledge, NeuroBot simplifies communication between professionals and patients while ensuring patients have 24-7 access to reliable, evidence-based information. Further refinement and research will enhance NeuroBot's ability to foster patient-centered communication, optimize clinical outcomes, and advance AI-driven innovations in health care delivery.</p>","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e74299"},"PeriodicalIF":5.8000,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Development and Validation of a Large Language Model-Powered Chatbot for Neurosurgery: Mixed Methods Study on Enhancing Perioperative Patient Education.\",\"authors\":\"Chung Man Ho, Shaowei Guan, Prudence Kwan-Lam Mok, Candice Hw Lam, Wai Ying Ho, Calvin Hoi-Kwan Mak, Harry Qin, Arkers Kwan Ching Wong, Vivian Hui\",\"doi\":\"10.2196/74299\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Perioperative education is crucial for optimizing outcomes in neuroendovascular procedures, where inadequate understanding can heighten patient anxiety and hinder care plan adherence. Current education models, reliant on traditional consultations and printed materials, often lack scalability and personalization. Artificial intelligence (AI)-powered chatbots have demonstrated efficacy in various health care contexts; however, their role in neuroendovascular perioperative support remains underexplored. Given the complexity of neuroendovascular procedures and the need for continuous, tailored patient education, AI chatbots have the potential to offer tailored perioperative guidance to improve patient education in this specialty.</p><p><strong>Objective: </strong>We aimed to develop, validate, and assess NeuroBot, an AI-driven system that uses large language models (LLMs) with retrieval-augmented generation to deliver timely, accurate, and evidence-based responses to patient inquiries in neurosurgery, ultimately improving the effectiveness of patient education.</p><p><strong>Methods: </strong>A mixed methods approach was used, consisting of 3 phases. In the first phase, internal validation, we compared the performance of Assistants API, ChatGPT, and Qwen by evaluating their responses to 306 bilingual neuroendovascular-related questions. The accuracy, relevance, and completeness of the responses were evaluated using a Likert scale; statistical analyses included ANOVA and paired t tests. In the second phase, external validation, 10 neurosurgical experts rated the responses generated by NeuroBot using the same evaluation metrics applied in the internal validation phase. The consistency of their ratings was measured using the intraclass correlation coefficient. Finally, in the third phase, a qualitative study was conducted through interviews with 18 health care providers, which helped identify key themes related to the NeuroBot's usability and perceived benefits. Thematic analysis was performed using NVivo and interrater reliability was confirmed through Cohen κ.</p><p><strong>Results: </strong>The Assistants API outperformed both ChatGPT and Qwen, achieving a mean accuracy score of 5.28 out of 6 (95% CI 5.21-5.35), with a statistically significant result (P<.001). External expert ratings for NeuroBot demonstrated significant improvements, with scores of 5.70 out of 6 (95% CI 5.46-5.94) for accuracy, 5.58 out of 6 (95% CI 5.45-5.94) for relevance, and 2.70 out of 3 (95% CI 2.73-2.97) for completeness. Qualitative insights highlighted NeuroBot's potential to reduce staff workload, enhance patient education, and deliver evidence-based responses.</p><p><strong>Conclusions: </strong>NeuroBot, leveraging LLMs with the retrieval-augmented generation technique, demonstrates the potential of LLM-based chatbots in perioperative neuroendovascular care, offering scalable and continuous support. By integrating domain-specific knowledge, NeuroBot simplifies communication between professionals and patients while ensuring patients have 24-7 access to reliable, evidence-based information. Further refinement and research will enhance NeuroBot's ability to foster patient-centered communication, optimize clinical outcomes, and advance AI-driven innovations in health care delivery.</p>\",\"PeriodicalId\":16337,\"journal\":{\"name\":\"Journal of Medical Internet Research\",\"volume\":\"27 \",\"pages\":\"e74299\"},\"PeriodicalIF\":5.8000,\"publicationDate\":\"2025-07-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Medical Internet Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/74299\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/74299","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
摘要
背景:围手术期教育对于优化神经血管内手术的结果至关重要,在围手术期,不充分的理解会增加患者的焦虑并阻碍护理计划的遵守。目前的教育模式依赖于传统的咨询和印刷材料,往往缺乏可扩展性和个性化。人工智能(AI)驱动的聊天机器人已经在各种医疗保健环境中证明了其功效;然而,它们在神经血管内围手术期支持中的作用仍未得到充分探讨。考虑到神经血管内手术的复杂性以及对持续、量身定制的患者教育的需求,人工智能聊天机器人有可能提供量身定制的围手术期指导,以改善该专业的患者教育。目的:我们旨在开发、验证和评估NeuroBot,这是一个人工智能驱动的系统,它使用具有检索增强生成的大型语言模型(llm),对神经外科患者的询问提供及时、准确和基于证据的回应,最终提高患者教育的有效性。方法:采用混合方法,分为3个阶段。在第一阶段的内部验证中,我们通过评估助手API、ChatGPT和Qwen对306个双语神经血管内相关问题的回答,比较了它们的表现。使用李克特量表评估回答的准确性、相关性和完整性;统计分析包括方差分析和配对t检验。在第二阶段,外部验证,10名神经外科专家使用与内部验证阶段相同的评估指标对NeuroBot产生的反应进行评级。使用类内相关系数来衡量其评级的一致性。最后,在第三阶段,通过对18家医疗服务提供者的访谈进行了定性研究,这有助于确定与NeuroBot的可用性和感知利益相关的关键主题。使用NVivo进行专题分析,并通过Cohen κ确认互读可靠性。结果:助手API的表现优于ChatGPT和Qwen,平均准确率为5.28分(95% CI 5.21-5.35),结果具有统计学意义(p结论:NeuroBot利用llm和检索增强生成技术,展示了基于llm的聊天机器人在围手术期神经血管内护理中的潜力,提供可扩展和持续的支持。通过整合特定领域的知识,NeuroBot简化了专业人员和患者之间的沟通,同时确保患者全天候获得可靠的循证信息。进一步的改进和研究将增强NeuroBot的能力,以促进以患者为中心的沟通,优化临床结果,并推进人工智能驱动的医疗保健服务创新。
Development and Validation of a Large Language Model-Powered Chatbot for Neurosurgery: Mixed Methods Study on Enhancing Perioperative Patient Education.
Background: Perioperative education is crucial for optimizing outcomes in neuroendovascular procedures, where inadequate understanding can heighten patient anxiety and hinder care plan adherence. Current education models, reliant on traditional consultations and printed materials, often lack scalability and personalization. Artificial intelligence (AI)-powered chatbots have demonstrated efficacy in various health care contexts; however, their role in neuroendovascular perioperative support remains underexplored. Given the complexity of neuroendovascular procedures and the need for continuous, tailored patient education, AI chatbots have the potential to offer tailored perioperative guidance to improve patient education in this specialty.
Objective: We aimed to develop, validate, and assess NeuroBot, an AI-driven system that uses large language models (LLMs) with retrieval-augmented generation to deliver timely, accurate, and evidence-based responses to patient inquiries in neurosurgery, ultimately improving the effectiveness of patient education.
Methods: A mixed methods approach was used, consisting of 3 phases. In the first phase, internal validation, we compared the performance of Assistants API, ChatGPT, and Qwen by evaluating their responses to 306 bilingual neuroendovascular-related questions. The accuracy, relevance, and completeness of the responses were evaluated using a Likert scale; statistical analyses included ANOVA and paired t tests. In the second phase, external validation, 10 neurosurgical experts rated the responses generated by NeuroBot using the same evaluation metrics applied in the internal validation phase. The consistency of their ratings was measured using the intraclass correlation coefficient. Finally, in the third phase, a qualitative study was conducted through interviews with 18 health care providers, which helped identify key themes related to the NeuroBot's usability and perceived benefits. Thematic analysis was performed using NVivo and interrater reliability was confirmed through Cohen κ.
Results: The Assistants API outperformed both ChatGPT and Qwen, achieving a mean accuracy score of 5.28 out of 6 (95% CI 5.21-5.35), with a statistically significant result (P<.001). External expert ratings for NeuroBot demonstrated significant improvements, with scores of 5.70 out of 6 (95% CI 5.46-5.94) for accuracy, 5.58 out of 6 (95% CI 5.45-5.94) for relevance, and 2.70 out of 3 (95% CI 2.73-2.97) for completeness. Qualitative insights highlighted NeuroBot's potential to reduce staff workload, enhance patient education, and deliver evidence-based responses.
Conclusions: NeuroBot, leveraging LLMs with the retrieval-augmented generation technique, demonstrates the potential of LLM-based chatbots in perioperative neuroendovascular care, offering scalable and continuous support. By integrating domain-specific knowledge, NeuroBot simplifies communication between professionals and patients while ensuring patients have 24-7 access to reliable, evidence-based information. Further refinement and research will enhance NeuroBot's ability to foster patient-centered communication, optimize clinical outcomes, and advance AI-driven innovations in health care delivery.
期刊介绍:
The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades.
As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor.
Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.