Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data.

IF 4.5 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Pub Date : 2024-03-01 Epub Date: 2024-03-06 DOI:10.1145/3643540

Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K Dey, Dakuo Wang

{"title":"Mental-LLM: Leveraging Large Language Models for Mental Health Prediction via Online Text Data.","authors":"Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K Dey, Dakuo Wang","doi":"10.1145/3643540","DOIUrl":null,"url":null,"abstract":"<p><p>Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.</p>","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":"8 1","pages":""},"PeriodicalIF":4.5000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11806945/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3643540","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/3/6 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research.

查看原文本刊更多论文

心理-法学硕士：利用大型语言模型通过在线文本数据进行心理健康预测。

大型语言模型（llm）的进步为各种应用程序提供了支持。然而，在理解和提高法学硕士在心理健康领域的能力方面，研究仍然存在很大的差距。在这项工作中，我们通过在线文本数据，包括Alpaca， Alpaca- lora, FLAN-T5, GPT-3.5和GPT-4，对多个LLMs进行了各种心理健康预测任务的综合评估。我们进行了广泛的实验，包括零弹提示、少弹提示和指令微调。结果表明，零镜头和少镜头提示设计的llm在心理健康任务中表现良好，但效果有限。更重要的是，我们的实验表明，指令微调可以显著提高llm同时处理所有任务的性能。我们的最佳微调模型Mental-Alpaca和Mental-FLAN-T5在平衡精度上比最佳提示设计的GPT-3.5（大25倍和15倍）高出10.9%，比最佳提示设计的GPT-4（大250倍和150倍）高出4.8%。它们的性能与最先进的任务特定语言模型相当。我们还对法学硕士在心理健康推理任务上的能力进行了探索性案例研究，说明了某些模型（如GPT-4）的前景。我们将我们的研究结果总结为一套行动指南，用于提高法学硕士心理健康任务能力的潜在方法。同时，我们还强调了在现实世界的心理健康环境中实现可部署性之前的重要限制，例如已知的种族和性别偏见。我们强调了伴随这一研究的重要伦理风险。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Computer Science-Computer Networks and Communications

CiteScore

9.10

自引率

0.00%

发文量

154