Digital Phenotyping for Detecting Depression Severity in a Large Payor-Provider System: Retrospective Study of Speech and Language Model Performance.

IF 2
JMIR AI Pub Date : 2025-06-19 DOI:10.2196/69149
Bradley Karlin, Doug Henry, Ryan Anderson, Salvatore Cieri, Michael Aratow, Elizabeth Shriberg, Michelle Hoy
{"title":"Digital Phenotyping for Detecting Depression Severity in a Large Payor-Provider System: Retrospective Study of Speech and Language Model Performance.","authors":"Bradley Karlin, Doug Henry, Ryan Anderson, Salvatore Cieri, Michael Aratow, Elizabeth Shriberg, Michelle Hoy","doi":"10.2196/69149","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>There is considerable need to improve and increase the detection and measurement of depression. The use of speech as a digital biomarker of depression represents a considerable opportunity for transforming and accelerating depression identification and treatment; however, research to date has primarily consisted of small-sample feasibility or pilot studies incorporating highly controlled applications and settings. There has been limited examination of the technology in real-world use contexts.</p><p><strong>Objective: </strong>This study evaluated the performance of a machine learning (ML) model examining both semantic and acoustic properties of speech in predicting depression across more than 2000 real-world interactions between health plan members and case managers.</p><p><strong>Methods: </strong>A total of 2086 recordings of case management calls with verbally administered Patient Health Questionnaire-9 questions (PHQ-9) surveys were analyzed using the ML model after the portions of the recordings with the PHQ-9 survey were manually redacted. The recordings were divided into a Development Set (Dev Set) (n=1336) and a Blind Set (n=671), and Patient Health Questionnaire-8 questions (PHQ-8) scores were provided for the Dev Set for ML model refinement while PHQ-8 scores from the Blind Set were withheld until after ML model depression severity output was reported.</p><p><strong>Results: </strong>The Dev Set and the Blind Set were well matched for age (Dev Set: mean 53.7, SD 16.3 years; Blind Set: mean 51.7, SD 16.9 years), gender (Dev Set: 910/1336, 68.1% of female participants; Blind Set: 462/671, 68.9% of female participants), and depression severity (Dev Set: mean 10.5, SD 6.1 of PHQ-8 scores; Blind Set: mean 10.9, SD 6.0 of PHQ-8 scores). The concordance correlation coefficient was ρc=0.57 for the test of the ML model on the Dev Set and ρc=0.54 on the Blind Set, while the mean absolute error was 3.91 for the Dev Set and 4.06 for the Blind Set, demonstrating strong model performance. This performance was maintained when dividing each set into subgroups of age brackets (≤39, 40-64, and ≥65 years), biological sex, and the 4 categories of Social Vulnerability Index (an index based on 16 social factors), with concordance correlation coefficients ranging as ρc=0.44-0.61. Performance at PHQ-8 threshold score cutoffs of 5, 10, 15, and 20, representing the depression severity categories of none, mild, moderate, moderately severe, and severe (≥20), respectively, expressed as area under the receiver operating characteristic curve values, varied between 0.79 and 0.83 in both the Dev and Blind Sets.</p><p><strong>Conclusions: </strong>Overall, the findings suggest that speech may have significant potential for detection and measurement of depression severity over a variety of ages, gender, and socioeconomic categories that may enhance treatment, improve clinical decision-making, and enable truly personalized treatment recommendations.</p>","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e69149"},"PeriodicalIF":2.0000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12223686/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/69149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: There is considerable need to improve and increase the detection and measurement of depression. The use of speech as a digital biomarker of depression represents a considerable opportunity for transforming and accelerating depression identification and treatment; however, research to date has primarily consisted of small-sample feasibility or pilot studies incorporating highly controlled applications and settings. There has been limited examination of the technology in real-world use contexts.

Objective: This study evaluated the performance of a machine learning (ML) model examining both semantic and acoustic properties of speech in predicting depression across more than 2000 real-world interactions between health plan members and case managers.

Methods: A total of 2086 recordings of case management calls with verbally administered Patient Health Questionnaire-9 questions (PHQ-9) surveys were analyzed using the ML model after the portions of the recordings with the PHQ-9 survey were manually redacted. The recordings were divided into a Development Set (Dev Set) (n=1336) and a Blind Set (n=671), and Patient Health Questionnaire-8 questions (PHQ-8) scores were provided for the Dev Set for ML model refinement while PHQ-8 scores from the Blind Set were withheld until after ML model depression severity output was reported.

Results: The Dev Set and the Blind Set were well matched for age (Dev Set: mean 53.7, SD 16.3 years; Blind Set: mean 51.7, SD 16.9 years), gender (Dev Set: 910/1336, 68.1% of female participants; Blind Set: 462/671, 68.9% of female participants), and depression severity (Dev Set: mean 10.5, SD 6.1 of PHQ-8 scores; Blind Set: mean 10.9, SD 6.0 of PHQ-8 scores). The concordance correlation coefficient was ρc=0.57 for the test of the ML model on the Dev Set and ρc=0.54 on the Blind Set, while the mean absolute error was 3.91 for the Dev Set and 4.06 for the Blind Set, demonstrating strong model performance. This performance was maintained when dividing each set into subgroups of age brackets (≤39, 40-64, and ≥65 years), biological sex, and the 4 categories of Social Vulnerability Index (an index based on 16 social factors), with concordance correlation coefficients ranging as ρc=0.44-0.61. Performance at PHQ-8 threshold score cutoffs of 5, 10, 15, and 20, representing the depression severity categories of none, mild, moderate, moderately severe, and severe (≥20), respectively, expressed as area under the receiver operating characteristic curve values, varied between 0.79 and 0.83 in both the Dev and Blind Sets.

Conclusions: Overall, the findings suggest that speech may have significant potential for detection and measurement of depression severity over a variety of ages, gender, and socioeconomic categories that may enhance treatment, improve clinical decision-making, and enable truly personalized treatment recommendations.

数字表型检测抑郁症严重程度在一个大型付款-提供者系统:语音和语言模型性能的回顾性研究。
背景:对抑郁症的检测和测量有很大的改进和增加的需要。使用语音作为抑郁症的数字生物标志物,为改变和加速抑郁症的识别和治疗提供了相当大的机会;然而,迄今为止的研究主要是小样本可行性研究或试点研究,包括高度控制的应用和设置。在实际使用环境中对该技术的研究有限。目的:本研究评估了机器学习(ML)模型的性能,该模型检测了语音的语义和声学特性,并在健康计划成员和病例管理人员之间的2000多个现实世界互动中预测抑郁症。方法:在对PHQ-9调查部分录音进行人工编辑后,使用ML模型对2086份口头病人健康问卷-9问题(PHQ-9)调查的病例管理电话录音进行分析。记录被分为开发集(Dev Set) (n=1336)和盲集(n=671),患者健康问卷-8个问题(PHQ-8)得分被提供给Dev Set用于ML模型改进,而盲集中的PHQ-8得分被保留,直到ML模型抑郁严重程度输出报告之后。结果:Dev组和Blind组在年龄上匹配良好(Dev组:平均53.7岁,SD 16.3岁;盲组:平均51.7岁,标准差16.9岁),性别(发展组:910/1336,女性参与者占68.1%;盲组:462/671,68.9%的女性参与者)和抑郁严重程度(Dev组:平均10.5,SD 6.1 PHQ-8评分;盲组:PHQ-8评分平均值10.9,标准差6.0)。ML模型在Dev Set上的一致性相关系数为ρc=0.57,在Blind Set上的一致性相关系数为ρc=0.54, Dev Set的平均绝对误差为3.91,Blind Set的平均绝对误差为4.06,模型性能较好。将每个集合划分为年龄层(≤39岁、40-64岁和≥65岁)、生理性别和社会脆弱性指数(基于16个社会因素的指数)4类,其一致性相关系数为ρc=0.44-0.61。PHQ-8阈值评分截止值分别为5、10、15和20,分别代表无、轻度、中度、中度和重度(≥20)的抑郁严重程度类别,以受试者工作特征曲线值下的面积表示,Dev组和Blind组的表现在0.79和0.83之间。结论:总的来说,研究结果表明,言语可能在检测和测量不同年龄、性别和社会经济类别的抑郁症严重程度方面具有重要的潜力,这可能会加强治疗,改善临床决策,并实现真正个性化的治疗建议。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信