心理治疗会话代理的质量评估：框架发展与横断面研究。

IF 2 Q3 HEALTH CARE SCIENCES & SERVICES

JMIR Formative Research Pub Date : 2025-07-02 DOI:10.2196/65605

Kunmi Sobowale, Daniel Kevin Humphrey

{"title":"心理治疗会话代理的质量评估：框架发展与横断面研究。","authors":"Kunmi Sobowale, Daniel Kevin Humphrey","doi":"10.2196/65605","DOIUrl":null,"url":null,"abstract":"Background: Despite potential risks, artificial intelligence-based chatbots that simulate psychotherapy are becoming more widely available and frequently used by the general public. A comprehensive way of evaluating the quality of these chatbots is needed.Objective: To address this need, we developed the CAPE (Conversational Agent for Psychotherapy Evaluation) framework to aid clinicians, researchers, and lay users in assessing psychotherapy chatbot quality. We use the framework to evaluate and compare the quality of popular artificial intelligence psychotherapy chatbots on the OpenAI GPT store.Methods: We identified 4 popular chatbots on OpenAI's GPT store. Two reviewers independently applied the CAPE framework to these chatbots, using 2 fictional personas to simulate interactions. The modular framework has 8 sections, each yielding an independent quality subscore between 0 and 1. We used t tests and nonparametric Wilcoxon signed rank tests to examine pairwise differences in quality subscores between chatbots.Results: Chatbots consistently scored highly on the sections of background information (subscores=0.83-1), conversational capabilities (subscores=0.83-1), therapeutic alliance, and boundaries (subscores=0.75-1), and accessibility (subscores=0.8-0.95). Scores were low for the therapeutic orientation (subscores=0) and monitoring and risk evaluation sections (subscores=0.67-0.75). Information on training data and knowledge base sections was not transparent (subscores=0). Except for the privacy and harm section (mean 0.017, SD 0.00; t3=∞; P<.001), there were no differences in subscores between the chatbots.Conclusions: The CAPE framework offers a robust and reliable method for assessing the quality of psychotherapy chatbots, enabling users to make informed choices based on their specific needs and preferences. Our evaluation revealed that while the popular chatbots on OpenAI's GPT store were effective at developing rapport and were easily accessible, they failed to address essential safety and privacy functions adequately.","PeriodicalId":14841,"journal":{"name":"JMIR Formative Research","volume":"9 ","pages":"e65605"},"PeriodicalIF":2.0000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12239686/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating the Quality of Psychotherapy Conversational Agents: Framework Development and Cross-Sectional Study.\",\"authors\":\"Kunmi Sobowale, Daniel Kevin Humphrey\",\"doi\":\"10.2196/65605\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Despite potential risks, artificial intelligence-based chatbots that simulate psychotherapy are becoming more widely available and frequently used by the general public. A comprehensive way of evaluating the quality of these chatbots is needed.Objective: To address this need, we developed the CAPE (Conversational Agent for Psychotherapy Evaluation) framework to aid clinicians, researchers, and lay users in assessing psychotherapy chatbot quality. We use the framework to evaluate and compare the quality of popular artificial intelligence psychotherapy chatbots on the OpenAI GPT store.Methods: We identified 4 popular chatbots on OpenAI's GPT store. Two reviewers independently applied the CAPE framework to these chatbots, using 2 fictional personas to simulate interactions. The modular framework has 8 sections, each yielding an independent quality subscore between 0 and 1. We used t tests and nonparametric Wilcoxon signed rank tests to examine pairwise differences in quality subscores between chatbots.Results: Chatbots consistently scored highly on the sections of background information (subscores=0.83-1), conversational capabilities (subscores=0.83-1), therapeutic alliance, and boundaries (subscores=0.75-1), and accessibility (subscores=0.8-0.95). Scores were low for the therapeutic orientation (subscores=0) and monitoring and risk evaluation sections (subscores=0.67-0.75). Information on training data and knowledge base sections was not transparent (subscores=0). Except for the privacy and harm section (mean 0.017, SD 0.00; t3=∞; P<.001), there were no differences in subscores between the chatbots.Conclusions: The CAPE framework offers a robust and reliable method for assessing the quality of psychotherapy chatbots, enabling users to make informed choices based on their specific needs and preferences. Our evaluation revealed that while the popular chatbots on OpenAI's GPT store were effective at developing rapport and were easily accessible, they failed to address essential safety and privacy functions adequately.\",\"PeriodicalId\":14841,\"journal\":{\"name\":\"JMIR Formative Research\",\"volume\":\"9 \",\"pages\":\"e65605\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12239686/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JMIR Formative Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2196/65605\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Formative Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/65605","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

背景：尽管存在潜在风险，但基于人工智能的模拟心理治疗的聊天机器人正变得越来越广泛，并被公众频繁使用。需要一种综合的方法来评估这些聊天机器人的质量。目的：为了满足这一需求，我们开发了CAPE（心理治疗评估会话代理）框架，以帮助临床医生、研究人员和非专业用户评估心理治疗聊天机器人的质量。我们使用该框架来评估和比较OpenAI GPT商店中流行的人工智能心理治疗聊天机器人的质量。方法：我们在OpenAI的GPT商店中确定了4个流行的聊天机器人。两名评论者独立地将CAPE框架应用于这些聊天机器人，使用两个虚构的角色来模拟交互。模块化框架有8个部分，每个部分产生一个独立的质量分值在0到1之间。我们使用t检验和非参数Wilcoxon符号秩检验来检验聊天机器人之间质量子分数的两两差异。结果：聊天机器人在背景信息（子得分=0.83-1）、会话能力（子得分=0.83-1）、治疗联盟和边界（子得分=0.75-1）以及可访问性（子得分=0.8-0.95）方面的得分一直很高。治疗取向（subscores=0）和监测和风险评估部分（subscores=0.67-0.75）得分较低。培训数据和知识库部分的信息不透明（分值=0）。除隐私和危害部分外(平均值0.017,SD 0.00；t3 =∞;结论：CAPE框架为评估心理治疗聊天机器人的质量提供了一种强大而可靠的方法，使用户能够根据他们的特定需求和偏好做出明智的选择。我们的评估显示，尽管OpenAI的GPT商店中流行的聊天机器人在建立融洽关系方面很有效，而且很容易获得，但它们未能充分解决基本的安全和隐私功能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating the Quality of Psychotherapy Conversational Agents: Framework Development and Cross-Sectional Study.

Background: Despite potential risks, artificial intelligence-based chatbots that simulate psychotherapy are becoming more widely available and frequently used by the general public. A comprehensive way of evaluating the quality of these chatbots is needed.

Objective: To address this need, we developed the CAPE (Conversational Agent for Psychotherapy Evaluation) framework to aid clinicians, researchers, and lay users in assessing psychotherapy chatbot quality. We use the framework to evaluate and compare the quality of popular artificial intelligence psychotherapy chatbots on the OpenAI GPT store.

Methods: We identified 4 popular chatbots on OpenAI's GPT store. Two reviewers independently applied the CAPE framework to these chatbots, using 2 fictional personas to simulate interactions. The modular framework has 8 sections, each yielding an independent quality subscore between 0 and 1. We used t tests and nonparametric Wilcoxon signed rank tests to examine pairwise differences in quality subscores between chatbots.

Results: Chatbots consistently scored highly on the sections of background information (subscores=0.83-1), conversational capabilities (subscores=0.83-1), therapeutic alliance, and boundaries (subscores=0.75-1), and accessibility (subscores=0.8-0.95). Scores were low for the therapeutic orientation (subscores=0) and monitoring and risk evaluation sections (subscores=0.67-0.75). Information on training data and knowledge base sections was not transparent (subscores=0). Except for the privacy and harm section (mean 0.017, SD 0.00; t3=∞; P<.001), there were no differences in subscores between the chatbots.

Conclusions: The CAPE framework offers a robust and reliable method for assessing the quality of psychotherapy chatbots, enabling users to make informed choices based on their specific needs and preferences. Our evaluation revealed that while the popular chatbots on OpenAI's GPT store were effective at developing rapport and were easily accessible, they failed to address essential safety and privacy functions adequately.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊