Integrative modeling enables ChatGPT to achieve average level of human counselors performance in mental health Q&A

IF 7.4 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2025-04-03 DOI:10.1016/j.ipm.2025.104152

Yinghui Huang , Weijun Wang , Jinyi Zhou , Liang Zhang , Jionghao Lin , Hui Liu , Xiangen Hu , Zongkui Zhou , Wanghao Dong

{"title":"Integrative modeling enables ChatGPT to achieve average level of human counselors performance in mental health Q&A","authors":"Yinghui Huang , Weijun Wang , Jinyi Zhou , Liang Zhang , Jionghao Lin , Hui Liu , Xiangen Hu , Zongkui Zhou , Wanghao Dong","doi":"10.1016/j.ipm.2025.104152","DOIUrl":null,"url":null,"abstract":"<div><div>Recent advancements in generative artificial intelligence (GenAI), particularly ChatGPT, have demonstrated significant potential in addressing the persistent treatment gap in mental health care. Systematic evaluation of ChatGPT’s capabilities in addressing mental health questions is essential for its large-scale application. The current study introduces a computational evaluation framework centered on perceived information quality (PIQ) to quantitatively assess ChatGPT’s capabilities. Leveraging datasets of question-answer pairs generated by both humans and ChatGPT, the framework integrates predictive modeling, explainable modeling, and prompt-engineering-based validation to identify intrinsic evaluation metrics and enable automated assessments. Results revealed that unprompted ChatGPT’s PIQ is significantly lower than that of human counselors overall, with notable deficiencies such as insufficient conversational length, lower text diversity, and reduced professionalism. Despite not matching the top 25% of human counselors, our evaluation framework improved ChatGPT’s mean PIQ by 8.91% to 11.67% across four risk levels. Prompted ChatGPT performed comparably to human counselors in severe (<em>p</em> = 0.0561) and moderate-risk questions (<em>p</em> = 0.7851), and significantly outperformed them in low- and no-risk categories by 6.80% and 4.63%, respectively (<em>p</em> < 0.001). However, undesirable verbal behaviors still persist in <em>text diversity</em> and <em>professionalism</em>. These findings validate ChatGPT’s capabilities to address mental health questions while cautioning that further researches are necessary for LLM-based mental health systems to deliver services comparable to human experts.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 5","pages":"Article 104152"},"PeriodicalIF":7.4000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325000937","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advancements in generative artificial intelligence (GenAI), particularly ChatGPT, have demonstrated significant potential in addressing the persistent treatment gap in mental health care. Systematic evaluation of ChatGPT’s capabilities in addressing mental health questions is essential for its large-scale application. The current study introduces a computational evaluation framework centered on perceived information quality (PIQ) to quantitatively assess ChatGPT’s capabilities. Leveraging datasets of question-answer pairs generated by both humans and ChatGPT, the framework integrates predictive modeling, explainable modeling, and prompt-engineering-based validation to identify intrinsic evaluation metrics and enable automated assessments. Results revealed that unprompted ChatGPT’s PIQ is significantly lower than that of human counselors overall, with notable deficiencies such as insufficient conversational length, lower text diversity, and reduced professionalism. Despite not matching the top 25% of human counselors, our evaluation framework improved ChatGPT’s mean PIQ by 8.91% to 11.67% across four risk levels. Prompted ChatGPT performed comparably to human counselors in severe (p = 0.0561) and moderate-risk questions (p = 0.7851), and significantly outperformed them in low- and no-risk categories by 6.80% and 4.63%, respectively (p < 0.001). However, undesirable verbal behaviors still persist in text diversity and professionalism. These findings validate ChatGPT’s capabilities to address mental health questions while cautioning that further researches are necessary for LLM-based mental health systems to deliver services comparable to human experts.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.