人工智能心理健康聊天机器人从基于规则的系统到大型语言模型的演变：系统回顾。

IF 65.8 1区医学 Q1 Medicine

World Psychiatry Pub Date : 2025-10-01 DOI:10.1002/wps.21352

Yining Hua,Steve Siddals,Zilin Ma,Isaac Galatzer-Levy,Winna Xia,Christine Hau,Hongbin Na,Matthew Flathers,Jake Linardon,Cyrus Ayubcha,John Torous

{"title":"人工智能心理健康聊天机器人从基于规则的系统到大型语言模型的演变：系统回顾。","authors":"Yining Hua,Steve Siddals,Zilin Ma,Isaac Galatzer-Levy,Winna Xia,Christine Hau,Hongbin Na,Matthew Flathers,Jake Linardon,Cyrus Ayubcha,John Torous","doi":"10.1002/wps.21352","DOIUrl":null,"url":null,"abstract":"The rapid evolution of artificial intelligence (AI) chatbots in mental health care presents a fragmented landscape with variable clinical evidence and evaluation rigor. This systematic review of 160 studies (2020-2024) classifies chatbot architectures - rule-based, machine learning-based, and large language model (LLM)-based - and proposes a three-tier evaluation framework: foundational bench testing (technical validation), pilot feasibility testing (user engagement), and clinical efficacy testing (symptom reduction). While rule-based systems dominated until 2023, LLM-based chatbots surged to 45% of new studies in 2024. However, only 16% of LLM studies underwent clinical efficacy testing, with most (77%) still in early validation. Overall, only 47% of studies focused on clinical efficacy testing, exposing a critical gap in robust validation of therapeutic benefit. Discrepancies emerged between marketed claims (\"AI-powered\") and actual AI architectures, with many interventions relying on simple rule-based scripts. LLM-based chatbots are increasingly studied for emotional support and psychoeducation, yet they pose unique ethical concerns, including incorrect responses, privacy risks, and unverified therapeutic effects. Despite their generative capabilities, LLMs remain largely untested in high-stakes mental health contexts. This paper emphasizes the need for standardized evaluation and benchmarking aligned with medical AI certification to ensure safe, transparent and ethical deployment. The proposed framework enables clearer distinctions between technical novelty and clinical efficacy, offering clinicians, researchers and regulators ordered steps to guide future standards and benchmarks. To ensure that AI chatbots enhance mental health care, future research must prioritize rigorous clinical efficacy trials, transparent architecture reporting, and evaluations that reflect real-world impact rather than the well-known potential.","PeriodicalId":23858,"journal":{"name":"World Psychiatry","volume":"124 1","pages":"383-394"},"PeriodicalIF":65.8000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Charting the evolution of artificial intelligence mental health chatbots from rule-based systems to large language models: a systematic review.\",\"authors\":\"Yining Hua,Steve Siddals,Zilin Ma,Isaac Galatzer-Levy,Winna Xia,Christine Hau,Hongbin Na,Matthew Flathers,Jake Linardon,Cyrus Ayubcha,John Torous\",\"doi\":\"10.1002/wps.21352\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid evolution of artificial intelligence (AI) chatbots in mental health care presents a fragmented landscape with variable clinical evidence and evaluation rigor. This systematic review of 160 studies (2020-2024) classifies chatbot architectures - rule-based, machine learning-based, and large language model (LLM)-based - and proposes a three-tier evaluation framework: foundational bench testing (technical validation), pilot feasibility testing (user engagement), and clinical efficacy testing (symptom reduction). While rule-based systems dominated until 2023, LLM-based chatbots surged to 45% of new studies in 2024. However, only 16% of LLM studies underwent clinical efficacy testing, with most (77%) still in early validation. Overall, only 47% of studies focused on clinical efficacy testing, exposing a critical gap in robust validation of therapeutic benefit. Discrepancies emerged between marketed claims (\\\"AI-powered\\\") and actual AI architectures, with many interventions relying on simple rule-based scripts. LLM-based chatbots are increasingly studied for emotional support and psychoeducation, yet they pose unique ethical concerns, including incorrect responses, privacy risks, and unverified therapeutic effects. Despite their generative capabilities, LLMs remain largely untested in high-stakes mental health contexts. This paper emphasizes the need for standardized evaluation and benchmarking aligned with medical AI certification to ensure safe, transparent and ethical deployment. The proposed framework enables clearer distinctions between technical novelty and clinical efficacy, offering clinicians, researchers and regulators ordered steps to guide future standards and benchmarks. To ensure that AI chatbots enhance mental health care, future research must prioritize rigorous clinical efficacy trials, transparent architecture reporting, and evaluations that reflect real-world impact rather than the well-known potential.\",\"PeriodicalId\":23858,\"journal\":{\"name\":\"World Psychiatry\",\"volume\":\"124 1\",\"pages\":\"383-394\"},\"PeriodicalIF\":65.8000,\"publicationDate\":\"2025-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"World Psychiatry\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/wps.21352\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Psychiatry","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/wps.21352","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

摘要

人工智能（AI）聊天机器人在精神卫生保健领域的快速发展呈现出碎片化的格局，临床证据和评估的严密性各不相同。本文系统回顾了160项研究（2020-2024），对聊天机器人架构进行了分类——基于规则的、基于机器学习的和基于大型语言模型（LLM）的——并提出了一个三层评估框架：基础台架测试（技术验证）、试点可行性测试（用户参与）和临床疗效测试（症状减轻）。尽管基于规则的系统在2023年之前占据主导地位，但基于法学硕士的聊天机器人在2024年的新研究中占比飙升至45%。然而，只有16%的LLM研究进行了临床疗效测试，大多数（77%）仍处于早期验证阶段。总体而言，只有47%的研究关注临床疗效测试，这暴露出在治疗益处的可靠验证方面存在重大差距。市场上的说法（“人工智能驱动”）和实际的人工智能架构之间出现了差异，许多干预措施依赖于简单的基于规则的脚本。基于法学硕士的聊天机器人越来越多地被研究用于情感支持和心理教育，但它们带来了独特的伦理问题，包括错误的反应、隐私风险和未经证实的治疗效果。尽管法学硕士具有生成能力，但在高风险的心理健康环境中，法学硕士在很大程度上尚未经过测试。本文强调需要进行与医疗人工智能认证相一致的标准化评估和基准制定，以确保安全、透明和合乎道德的部署。拟议的框架使技术新颖性和临床疗效之间的区别更加清晰，为临床医生、研究人员和监管机构提供了指导未来标准和基准的有序步骤。为了确保人工智能聊天机器人加强精神卫生保健，未来的研究必须优先考虑严格的临床疗效试验、透明的架构报告和反映现实世界影响的评估，而不是众所周知的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Charting the evolution of artificial intelligence mental health chatbots from rule-based systems to large language models: a systematic review.

The rapid evolution of artificial intelligence (AI) chatbots in mental health care presents a fragmented landscape with variable clinical evidence and evaluation rigor. This systematic review of 160 studies (2020-2024) classifies chatbot architectures - rule-based, machine learning-based, and large language model (LLM)-based - and proposes a three-tier evaluation framework: foundational bench testing (technical validation), pilot feasibility testing (user engagement), and clinical efficacy testing (symptom reduction). While rule-based systems dominated until 2023, LLM-based chatbots surged to 45% of new studies in 2024. However, only 16% of LLM studies underwent clinical efficacy testing, with most (77%) still in early validation. Overall, only 47% of studies focused on clinical efficacy testing, exposing a critical gap in robust validation of therapeutic benefit. Discrepancies emerged between marketed claims ("AI-powered") and actual AI architectures, with many interventions relying on simple rule-based scripts. LLM-based chatbots are increasingly studied for emotional support and psychoeducation, yet they pose unique ethical concerns, including incorrect responses, privacy risks, and unverified therapeutic effects. Despite their generative capabilities, LLMs remain largely untested in high-stakes mental health contexts. This paper emphasizes the need for standardized evaluation and benchmarking aligned with medical AI certification to ensure safe, transparent and ethical deployment. The proposed framework enables clearer distinctions between technical novelty and clinical efficacy, offering clinicians, researchers and regulators ordered steps to guide future standards and benchmarks. To ensure that AI chatbots enhance mental health care, future research must prioritize rigorous clinical efficacy trials, transparent architecture reporting, and evaluations that reflect real-world impact rather than the well-known potential.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

World Psychiatry Nursing-Psychiatric Mental Health

CiteScore

64.10

自引率

7.40%

发文量

124

期刊介绍： World Psychiatry is the official journal of the World Psychiatric Association. It aims to disseminate information on significant clinical, service, and research developments in the mental health field. World Psychiatry is published three times per year and is sent free of charge to psychiatrists.The recipient psychiatrists' names and addresses are provided by WPA member societies and sections.The language used in the journal is designed to be understandable by the majority of mental health professionals worldwide.