Natural Language Response Formats for Assessing Depression and Worry With Large Language Models: A Sequential Evaluation With Model Pre-Registration.

IF 3.4 2区心理学 Q1 PSYCHOLOGY, CLINICAL

Assessment Pub Date : 2025-09-20 DOI:10.1177/10731911251364022

Zhuojun Gu, Katarina Kjell, H Andrew Schwartz, Oscar Kjell

{"title":"Natural Language Response Formats for Assessing Depression and Worry With Large Language Models: A Sequential Evaluation With Model Pre-Registration.","authors":"Zhuojun Gu, Katarina Kjell, H Andrew Schwartz, Oscar Kjell","doi":"10.1177/10731911251364022","DOIUrl":null,"url":null,"abstract":"Large language models can transform individuals' mental health descriptions into scores that correlate with rating scales approaching theoretical upper limits. However, such analyses have combined word- and text responses with little known about their differences. We develop response formats ranging from closed-ended to open-ended: (a) select words from lists, write (b) descriptive words, (c) phrases, or (d) texts. Participants answered questions about their depression/worry using the response formats and related rating scales. Language responses were transformed into word embeddings and trained to rating scales. We compare the validity (concurrent, incremental, face, discriminant, and external validity) and reliability (prospective sample and test-retest reliability) of the response formats. Using the Sequential Evaluation with Model Pre-Registration design, machine-learning models were trained on a development dataset (N = 963), and then pre-registered before tested on a prospective sample (N = 145). The pre-registered models demonstrate strong validity and reliability, yielding high accuracy in the prospective sample (r = .60-.79). Additionally, the models demonstrated external validity to self-reported sick-leave/healthcare visits, where the text-format yielded the strongest correlations (being higher/equal to rating scales for 9 of 12 cases). The overall high validity and reliability across formats suggest the possibility of choosing formats according to clinical needs.","PeriodicalId":8577,"journal":{"name":"Assessment","volume":" ","pages":"10731911251364022"},"PeriodicalIF":3.4000,"publicationDate":"2025-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Assessment","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1177/10731911251364022","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, CLINICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models can transform individuals' mental health descriptions into scores that correlate with rating scales approaching theoretical upper limits. However, such analyses have combined word- and text responses with little known about their differences. We develop response formats ranging from closed-ended to open-ended: (a) select words from lists, write (b) descriptive words, (c) phrases, or (d) texts. Participants answered questions about their depression/worry using the response formats and related rating scales. Language responses were transformed into word embeddings and trained to rating scales. We compare the validity (concurrent, incremental, face, discriminant, and external validity) and reliability (prospective sample and test-retest reliability) of the response formats. Using the Sequential Evaluation with Model Pre-Registration design, machine-learning models were trained on a development dataset (N = 963), and then pre-registered before tested on a prospective sample (N = 145). The pre-registered models demonstrate strong validity and reliability, yielding high accuracy in the prospective sample (r = .60-.79). Additionally, the models demonstrated external validity to self-reported sick-leave/healthcare visits, where the text-format yielded the strongest correlations (being higher/equal to rating scales for 9 of 12 cases). The overall high validity and reliability across formats suggest the possibility of choosing formats according to clinical needs.

查看原文本刊更多论文

用大语言模型评估抑郁和忧虑的自然语言反应格式：模型预注册的顺序评价。

大型语言模型可以将个人的心理健康描述转化为与接近理论上限的评分量表相关的分数。然而，这样的分析将单词和文本反应结合在一起，对它们的差异知之甚少。我们开发了从封闭式到开放式的回答格式：(a)从列表中选择单词，写(b)描述性单词，(c)短语或(d)文本。参与者使用回答格式和相关的评分量表回答有关他们抑郁/担忧的问题。语言反应被转化为词嵌入，并被训练成评分量表。我们比较了反应格式的效度（并发效度、增量效度、面对效度、判别效度和外部效度）和信度（前瞻性样本和重测信度）。使用带有模型预注册设计的顺序评估，机器学习模型在开发数据集（N = 963）上进行训练，然后在对预期样本（N = 145）进行测试之前进行预注册。预注册模型显示出很强的有效性和可靠性，在预期样本中产生很高的准确性（r = 0.60 - 0.79）。此外，这些模型还证明了自我报告的病假/医疗访问的外部有效性，其中文本格式产生了最强的相关性（在12个案例中有9个案例的评分等级更高/等于评分等级）。各格式的整体高效度和信度提示根据临床需要选择格式的可能性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Assessment PSYCHOLOGY, CLINICAL-

CiteScore

8.90

自引率

2.60%

发文量

期刊介绍： Assessment publishes articles in the domain of applied clinical assessment. The emphasis of this journal is on publication of information of relevance to the use of assessment measures, including test development, validation, and interpretation practices. The scope of the journal includes research that can inform assessment practices in mental health, forensic, medical, and other applied settings. Papers that focus on the assessment of cognitive and neuropsychological functioning, personality, and psychopathology are invited. Most papers published in Assessment report the results of original empirical research, however integrative review articles and scholarly case studies will also be considered.