评估心理健康中的生成式人工智能：能力和局限性的系统回顾。

IF 4.8 2区医学 Q1 PSYCHIATRY

Jmir Mental Health Pub Date : 2025-05-15 DOI:10.2196/70014

Liying Wang, Tanmay Bhanushali, Zhuoran Huang, Jingyi Yang, Sukriti Badami, Lisa Hightow-Weidman

{"title":"评估心理健康中的生成式人工智能：能力和局限性的系统回顾。","authors":"Liying Wang, Tanmay Bhanushali, Zhuoran Huang, Jingyi Yang, Sukriti Badami, Lisa Hightow-Weidman","doi":"10.2196/70014","DOIUrl":null,"url":null,"abstract":"Background: The global shortage of mental health professionals, exacerbated by increasing mental health needs post COVID-19, has stimulated growing interest in leveraging large language models to address these challenges.Objectives: This systematic review aims to evaluate the current capabilities of generative artificial intelligence (GenAI) models in the context of mental health applications.Methods: A comprehensive search across 5 databases yielded 1046 references, of which 8 studies met the inclusion criteria. The included studies were original research with experimental designs (eg, Turing tests, sociocognitive tasks, trials, or qualitative methods); a focus on GenAI models; and explicit measurement of sociocognitive abilities (eg, empathy and emotional awareness), mental health outcomes, and user experience (eg, perceived trust and empathy).Results: The studies, published between 2023 and 2024, primarily evaluated models such as ChatGPT-3.5 and 4.0, Bard, and Claude in tasks such as psychoeducation, diagnosis, emotional awareness, and clinical interventions. Most studies used zero-shot prompting and human evaluators to assess the AI responses, using standardized rating scales or qualitative analysis. However, these methods were often insufficient to fully capture the complexity of GenAI capabilities. The reliance on single-shot prompting techniques, limited comparisons, and task-based assessments isolated from a context may oversimplify GenAI's abilities and overlook the nuances of human-artificial intelligence interaction, especially in clinical applications that require contextual reasoning and cultural sensitivity. The findings suggest that while GenAI models demonstrate strengths in psychoeducation and emotional awareness, their diagnostic accuracy, cultural competence, and ability to engage users emotionally remain limited. Users frequently reported concerns about trustworthiness, accuracy, and the lack of emotional engagement.Conclusions: Future research could use more sophisticated evaluation methods, such as few-shot and chain-of-thought prompting to fully uncover GenAI's potential. Longitudinal studies and broader comparisons with human benchmarks are needed to explore the effects of GenAI-integrated mental health care.","PeriodicalId":48616,"journal":{"name":"Jmir Mental Health","volume":"12 ","pages":"e70014"},"PeriodicalIF":4.8000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12097452/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating Generative AI in Mental Health: Systematic Review of Capabilities and Limitations.\",\"authors\":\"Liying Wang, Tanmay Bhanushali, Zhuoran Huang, Jingyi Yang, Sukriti Badami, Lisa Hightow-Weidman\",\"doi\":\"10.2196/70014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The global shortage of mental health professionals, exacerbated by increasing mental health needs post COVID-19, has stimulated growing interest in leveraging large language models to address these challenges.Objectives: This systematic review aims to evaluate the current capabilities of generative artificial intelligence (GenAI) models in the context of mental health applications.Methods: A comprehensive search across 5 databases yielded 1046 references, of which 8 studies met the inclusion criteria. The included studies were original research with experimental designs (eg, Turing tests, sociocognitive tasks, trials, or qualitative methods); a focus on GenAI models; and explicit measurement of sociocognitive abilities (eg, empathy and emotional awareness), mental health outcomes, and user experience (eg, perceived trust and empathy).Results: The studies, published between 2023 and 2024, primarily evaluated models such as ChatGPT-3.5 and 4.0, Bard, and Claude in tasks such as psychoeducation, diagnosis, emotional awareness, and clinical interventions. Most studies used zero-shot prompting and human evaluators to assess the AI responses, using standardized rating scales or qualitative analysis. However, these methods were often insufficient to fully capture the complexity of GenAI capabilities. The reliance on single-shot prompting techniques, limited comparisons, and task-based assessments isolated from a context may oversimplify GenAI's abilities and overlook the nuances of human-artificial intelligence interaction, especially in clinical applications that require contextual reasoning and cultural sensitivity. The findings suggest that while GenAI models demonstrate strengths in psychoeducation and emotional awareness, their diagnostic accuracy, cultural competence, and ability to engage users emotionally remain limited. Users frequently reported concerns about trustworthiness, accuracy, and the lack of emotional engagement.Conclusions: Future research could use more sophisticated evaluation methods, such as few-shot and chain-of-thought prompting to fully uncover GenAI's potential. Longitudinal studies and broader comparisons with human benchmarks are needed to explore the effects of GenAI-integrated mental health care.\",\"PeriodicalId\":48616,\"journal\":{\"name\":\"Jmir Mental Health\",\"volume\":\"12 \",\"pages\":\"e70014\"},\"PeriodicalIF\":4.8000,\"publicationDate\":\"2025-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12097452/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Jmir Mental Health\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.2196/70014\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PSYCHIATRY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Jmir Mental Health","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/70014","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHIATRY","Score":null,"Total":0}

引用次数: 0

摘要

背景：全球精神卫生专业人员短缺，加上COVID-19后精神卫生需求的增加，促使人们越来越关注利用大型语言模型来应对这些挑战。目的：本系统综述旨在评估当前生成式人工智能（GenAI）模型在心理健康应用中的能力。方法：综合检索5个数据库1046篇文献，其中8篇符合纳入标准。纳入的研究是具有实验设计的原始研究（例如，图灵测试、社会认知任务、试验或定性方法）；对GenAI模型的关注；明确测量社会认知能力（如共情和情感意识）、心理健康结果和用户体验（如感知信任和共情）。结果：这些研究发表于2023年至2024年之间，主要评估了ChatGPT-3.5和4.0、Bard和Claude等模型在心理教育、诊断、情绪意识和临床干预等任务中的作用。大多数研究使用零得分提示和人类评估人员来评估人工智能的反应，使用标准化评级量表或定性分析。然而，这些方法往往不足以完全捕捉GenAI能力的复杂性。对单次提示技术的依赖、有限的比较和孤立于上下文的基于任务的评估可能会过度简化GenAI的能力，并忽略了人类与人工智能交互的细微差别，特别是在需要上下文推理和文化敏感性的临床应用中。研究结果表明，虽然GenAI模型在心理教育和情感意识方面表现出优势，但它们的诊断准确性、文化能力和情感吸引用户的能力仍然有限。用户经常反映对可信度、准确性和缺乏情感投入的担忧。结论：未来的研究可以使用更复杂的评估方法，如few-shot和chain-of-thought提示，以充分挖掘GenAI的潜力。需要进行纵向研究并与人类基准进行更广泛的比较，以探索整合了genai的精神卫生保健的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating Generative AI in Mental Health: Systematic Review of Capabilities and Limitations.

Background: The global shortage of mental health professionals, exacerbated by increasing mental health needs post COVID-19, has stimulated growing interest in leveraging large language models to address these challenges.

Objectives: This systematic review aims to evaluate the current capabilities of generative artificial intelligence (GenAI) models in the context of mental health applications.

Methods: A comprehensive search across 5 databases yielded 1046 references, of which 8 studies met the inclusion criteria. The included studies were original research with experimental designs (eg, Turing tests, sociocognitive tasks, trials, or qualitative methods); a focus on GenAI models; and explicit measurement of sociocognitive abilities (eg, empathy and emotional awareness), mental health outcomes, and user experience (eg, perceived trust and empathy).

Results: The studies, published between 2023 and 2024, primarily evaluated models such as ChatGPT-3.5 and 4.0, Bard, and Claude in tasks such as psychoeducation, diagnosis, emotional awareness, and clinical interventions. Most studies used zero-shot prompting and human evaluators to assess the AI responses, using standardized rating scales or qualitative analysis. However, these methods were often insufficient to fully capture the complexity of GenAI capabilities. The reliance on single-shot prompting techniques, limited comparisons, and task-based assessments isolated from a context may oversimplify GenAI's abilities and overlook the nuances of human-artificial intelligence interaction, especially in clinical applications that require contextual reasoning and cultural sensitivity. The findings suggest that while GenAI models demonstrate strengths in psychoeducation and emotional awareness, their diagnostic accuracy, cultural competence, and ability to engage users emotionally remain limited. Users frequently reported concerns about trustworthiness, accuracy, and the lack of emotional engagement.

Conclusions: Future research could use more sophisticated evaluation methods, such as few-shot and chain-of-thought prompting to fully uncover GenAI's potential. Longitudinal studies and broader comparisons with human benchmarks are needed to explore the effects of GenAI-integrated mental health care.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Jmir Mental Health Medicine-Psychiatry and Mental Health

CiteScore

10.80

自引率

3.80%

发文量

104

审稿时长

16 weeks

期刊介绍： JMIR Mental Health (JMH, ISSN 2368-7959) is a PubMed-indexed, peer-reviewed sister journal of JMIR, the leading eHealth journal (Impact Factor 2016: 5.175). JMIR Mental Health focusses on digital health and Internet interventions, technologies and electronic innovations (software and hardware) for mental health, addictions, online counselling and behaviour change. This includes formative evaluation and system descriptions, theoretical papers, review papers, viewpoint/vision papers, and rigorous evaluations.