Leveraging large language models for automated depression screening.

IF 7.7

PLOS digital health Pub Date : 2025-07-28 eCollection Date: 2025-07-01 DOI:10.1371/journal.pdig.0000943

Bazen Gashaw Teferra, Argyrios Perivolaris, Wei-Ni Hsiang, Christian Kevin Sidharta, Alice Rueda, Karisa Parkington, Yuqi Wu, Achint Soni, Reza Samavi, Rakesh Jetly, Yanbo Zhang, Bo Cao, Sirisha Rambhatla, Sri Krishnan, Venkat Bhat

{"title":"Leveraging large language models for automated depression screening.","authors":"Bazen Gashaw Teferra, Argyrios Perivolaris, Wei-Ni Hsiang, Christian Kevin Sidharta, Alice Rueda, Karisa Parkington, Yuqi Wu, Achint Soni, Reza Samavi, Rakesh Jetly, Yanbo Zhang, Bo Cao, Sirisha Rambhatla, Sri Krishnan, Venkat Bhat","doi":"10.1371/journal.pdig.0000943","DOIUrl":null,"url":null,"abstract":"<p><p>Mental health diagnoses possess unique challenges that often lead to nuanced difficulties in managing an individual's well-being and daily functioning. Self-report questionnaires are a common practice in clinical settings to help mitigate the challenges involved in mental health disorder screening. However, these questionnaires rely on an individual's subjective response which can be influenced by various factors. Despite the advancements of Large Language Models (LLMs), quantifying self-reported experiences with natural language processing has resulted in imperfect accuracy. This project aims to demonstrate the effectiveness of zero-shot learning LLMs for screening and assessing item scales for depression using LLMs. The DAIC-WOZ is a publicly available mental health dataset that contains textual data from clinical interviews and self-report questionnaires with relevant mental health disorder labels. The RISEN prompt engineering framework was utilized to evaluate LLMs' effectiveness in predicting depression symptoms based on individual PHQ-8 items. Various LLMs, including GPT models, Llama3_8B, Cohere, and Gemini were assessed based on performance. The GPT models, especially GPT-4o, were consistently better than other LLMs (Llama3_8B, Cohere, Gemini) across all eight items of the PHQ-8 scale in accuracy (M = 75.9%), and F1 score (0.74). GPT models were able to predict PHQ-8 items related to emotional and cognitive states. Llama 3_8B demonstrated superior detection of anhedonia-related symptoms and the Cohere LLM's strength was identifying and predicting psychomotor activity symptoms. This study provides a novel outlook on the potential of LLMs for predicting self-reported questionnaire scores from textual interview data. The promising preliminary performance of the various models indicates there is potential that these models could effectively assist in the screening of depression. Further research is needed to establish a framework for which LLM can be used for specific mental health symptoms and other disorders. As well, analysis of additional datasets while fine-tuning models should be explored.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"4 7","pages":"e0000943"},"PeriodicalIF":7.7000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12303271/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000943","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Mental health diagnoses possess unique challenges that often lead to nuanced difficulties in managing an individual's well-being and daily functioning. Self-report questionnaires are a common practice in clinical settings to help mitigate the challenges involved in mental health disorder screening. However, these questionnaires rely on an individual's subjective response which can be influenced by various factors. Despite the advancements of Large Language Models (LLMs), quantifying self-reported experiences with natural language processing has resulted in imperfect accuracy. This project aims to demonstrate the effectiveness of zero-shot learning LLMs for screening and assessing item scales for depression using LLMs. The DAIC-WOZ is a publicly available mental health dataset that contains textual data from clinical interviews and self-report questionnaires with relevant mental health disorder labels. The RISEN prompt engineering framework was utilized to evaluate LLMs' effectiveness in predicting depression symptoms based on individual PHQ-8 items. Various LLMs, including GPT models, Llama3_8B, Cohere, and Gemini were assessed based on performance. The GPT models, especially GPT-4o, were consistently better than other LLMs (Llama3_8B, Cohere, Gemini) across all eight items of the PHQ-8 scale in accuracy (M = 75.9%), and F1 score (0.74). GPT models were able to predict PHQ-8 items related to emotional and cognitive states. Llama 3_8B demonstrated superior detection of anhedonia-related symptoms and the Cohere LLM's strength was identifying and predicting psychomotor activity symptoms. This study provides a novel outlook on the potential of LLMs for predicting self-reported questionnaire scores from textual interview data. The promising preliminary performance of the various models indicates there is potential that these models could effectively assist in the screening of depression. Further research is needed to establish a framework for which LLM can be used for specific mental health symptoms and other disorders. As well, analysis of additional datasets while fine-tuning models should be explored.

Abstract Image

查看原文本刊更多论文

利用大型语言模型自动筛选抑郁症。

心理健康诊断具有独特的挑战，通常会导致在管理个人健康和日常功能方面的细微困难。在临床环境中，自我报告问卷是一种常见的做法，有助于减轻心理健康障碍筛查所涉及的挑战。然而，这些问卷依赖于个人的主观反应，这可能受到各种因素的影响。尽管大型语言模型（llm）取得了进步，但用自然语言处理量化自我报告的经验导致了不完美的准确性。本项目旨在证明零射击学习llm在使用llm筛选和评估抑郁症项目量表中的有效性。DAIC-WOZ是一个公开的心理健康数据集，包含来自临床访谈和带有相关心理健康障碍标签的自我报告问卷的文本数据。利用RISEN提示工程框架评估LLMs在预测基于PHQ-8单项抑郁症状方面的有效性。各种llm，包括GPT模型，Llama3_8B， Cohere和Gemini，基于性能进行评估。在PHQ-8量表的8个项目中，GPT模型，尤其是GPT- 40模型在准确性（M = 75.9%）和F1得分（0.74）上均优于其他LLMs （Llama3_8B、Cohere、Gemini）。GPT模型能够预测与情绪和认知状态相关的PHQ-8项目。Llama 3_8B在快感缺乏相关症状的检测上表现优异，而Cohere LLM在识别和预测精神运动活动症状方面表现优异。本研究为法学硕士从文本访谈数据预测自我报告问卷得分的潜力提供了一个新的前景。各种模型的初步表现表明，这些模型有可能有效地辅助抑郁症的筛查。需要进一步的研究来建立一个框架，使法学硕士可以用于特定的精神健康症状和其他疾病。此外，在对模型进行微调的同时，还应该对其他数据集进行分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLOS digital health

自引率

0.00%

发文量