Automated Speech Analysis for Risk Detection of Depression, Anxiety, Insomnia, and Fatigue: Algorithm Development and Validation Study.

IF 5.8 2区 医学 Q1 HEALTH CARE SCIENCES & SERVICES
Rachid Riad, Martin Denais, Marc de Gennes, Adrien Lesage, Vincent Oustric, Xuan Nga Cao, Stéphane Mouchabac, Alexis Bourla
{"title":"Automated Speech Analysis for Risk Detection of Depression, Anxiety, Insomnia, and Fatigue: Algorithm Development and Validation Study.","authors":"Rachid Riad, Martin Denais, Marc de Gennes, Adrien Lesage, Vincent Oustric, Xuan Nga Cao, Stéphane Mouchabac, Alexis Bourla","doi":"10.2196/58572","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>While speech analysis holds promise for mental health assessment, research often focuses on single symptoms, despite symptom co-occurrences and interactions. In addition, predictive models in mental health do not properly assess the limitations of speech-based systems, such as uncertainty, or fairness for a safe clinical deployment.</p><p><strong>Objective: </strong>We investigated the predictive potential of mobile-collected speech data for detecting and estimating depression, anxiety, fatigue, and insomnia, focusing on other factors than mere accuracy, in the general population.</p><p><strong>Methods: </strong>We included 865 healthy adults and recorded their answers regarding their perceived mental and sleep states. We asked how they felt and if they had slept well lately. Clinically validated questionnaires measuring depression, anxiety, insomnia, and fatigue severity were also used. We developed a novel speech and machine learning pipeline involving voice activity detection, feature extraction, and model training. We automatically modeled speech with pretrained deep learning models that were pretrained on a large, open, and free database, and we selected the best one on the validation set. Based on the best speech modeling approach, clinical threshold detection, individual score prediction, model uncertainty estimation, and performance fairness across demographics (age, sex, and education) were evaluated. We used a train-validation-test split for all evaluations: to develop our models, select the best ones, and assess the generalizability of held-out data.</p><p><strong>Results: </strong>The best model was Whisper M with a max pooling and oversampling method. Our methods achieved good detection performance for all symptoms, depression (Patient Health Questionnaire-9: area under the curve [AUC]=0.76; F<sub>1</sub>-score=0.49 and Beck Depression Inventory: AUC=0.78; F<sub>1</sub>-score=0.65), anxiety (Generalized Anxiety Disorder 7-item scale: AUC=0.77; F<sub>1</sub>-score=0.50), insomnia (Athens Insomnia Scale: AUC=0.73; F<sub>1</sub>-score=0.62), and fatigue (Multidimensional Fatigue Inventory total score: AUC=0.68; F<sub>1</sub>-score=0.88). The system performed well when it needed to abstain from making predictions, as demonstrated by low abstention rates in depression detection with the Beck Depression Inventory and fatigue, with risk-coverage AUCs below 0.4. Individual symptom scores were accurately predicted (correlations were all significant with Pearson strengths between 0.31 and 0.49). Fairness analysis revealed that models were consistent for sex (average disparity ratio [DR] 0.86, SD 0.13), to a lesser extent for education level (average DR 0.47, SD 0.30), and worse for age groups (average DR 0.33, SD 0.30).</p><p><strong>Conclusions: </strong>This study demonstrates the potential of speech-based systems for multifaceted mental health assessment in the general population, not only for detecting clinical thresholds but also for estimating their severity. Addressing fairness and incorporating uncertainty estimation with selective classification are key contributions that can enhance the clinical utility and responsible implementation of such systems.</p>","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":" ","pages":"e58572"},"PeriodicalIF":5.8000,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11565087/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"88","ListUrlMain":"https://doi.org/10.2196/58572","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: While speech analysis holds promise for mental health assessment, research often focuses on single symptoms, despite symptom co-occurrences and interactions. In addition, predictive models in mental health do not properly assess the limitations of speech-based systems, such as uncertainty, or fairness for a safe clinical deployment.

Objective: We investigated the predictive potential of mobile-collected speech data for detecting and estimating depression, anxiety, fatigue, and insomnia, focusing on other factors than mere accuracy, in the general population.

Methods: We included 865 healthy adults and recorded their answers regarding their perceived mental and sleep states. We asked how they felt and if they had slept well lately. Clinically validated questionnaires measuring depression, anxiety, insomnia, and fatigue severity were also used. We developed a novel speech and machine learning pipeline involving voice activity detection, feature extraction, and model training. We automatically modeled speech with pretrained deep learning models that were pretrained on a large, open, and free database, and we selected the best one on the validation set. Based on the best speech modeling approach, clinical threshold detection, individual score prediction, model uncertainty estimation, and performance fairness across demographics (age, sex, and education) were evaluated. We used a train-validation-test split for all evaluations: to develop our models, select the best ones, and assess the generalizability of held-out data.

Results: The best model was Whisper M with a max pooling and oversampling method. Our methods achieved good detection performance for all symptoms, depression (Patient Health Questionnaire-9: area under the curve [AUC]=0.76; F1-score=0.49 and Beck Depression Inventory: AUC=0.78; F1-score=0.65), anxiety (Generalized Anxiety Disorder 7-item scale: AUC=0.77; F1-score=0.50), insomnia (Athens Insomnia Scale: AUC=0.73; F1-score=0.62), and fatigue (Multidimensional Fatigue Inventory total score: AUC=0.68; F1-score=0.88). The system performed well when it needed to abstain from making predictions, as demonstrated by low abstention rates in depression detection with the Beck Depression Inventory and fatigue, with risk-coverage AUCs below 0.4. Individual symptom scores were accurately predicted (correlations were all significant with Pearson strengths between 0.31 and 0.49). Fairness analysis revealed that models were consistent for sex (average disparity ratio [DR] 0.86, SD 0.13), to a lesser extent for education level (average DR 0.47, SD 0.30), and worse for age groups (average DR 0.33, SD 0.30).

Conclusions: This study demonstrates the potential of speech-based systems for multifaceted mental health assessment in the general population, not only for detecting clinical thresholds but also for estimating their severity. Addressing fairness and incorporating uncertainty estimation with selective classification are key contributions that can enhance the clinical utility and responsible implementation of such systems.

用于抑郁、焦虑、失眠和疲劳风险检测的自动语音分析:算法开发与验证研究。
背景:虽然语音分析在心理健康评估方面大有可为,但研究往往只关注单一症状,而忽视了症状的共同出现和相互作用。此外,心理健康预测模型没有正确评估基于语音的系统的局限性,如不确定性或临床安全部署的公平性:我们研究了移动收集的语音数据在普通人群中检测和估计抑郁、焦虑、疲劳和失眠的预测潜力,重点关注除准确性以外的其他因素:我们的研究对象包括 865 名健康成年人,记录了他们对精神和睡眠状态的感知。我们询问了他们的感受以及最近是否睡得好。我们还使用了经临床验证的调查问卷,测量抑郁、焦虑、失眠和疲劳的严重程度。我们开发了一个新颖的语音和机器学习管道,涉及语音活动检测、特征提取和模型训练。我们使用全 ML 自动管道对参与者的语音进行了自动分析,以捕捉语音的可变性。然后,我们使用在大型开放式免费数据库上预先训练好的深度学习模型对语音进行建模,并在验证集上选出最佳模型。在最佳语音建模方法的基础上,我们评估了临床阈值检测、个人得分预测、模型不确定性估计以及不同人口统计学特征(年龄、性别、教育程度)下的性能公平性。我们在所有评估中都采用了 "训练-验证-测试 "的方法:开发模型、选择最佳模型并评估保留数据的通用性:最佳模型是采用最大集合和超采样方法的 WhisperM。我们的方法对所有症状都有良好的检测效果,包括抑郁(PHQ-9 AUC= 0.76F1=0.49, BDI AUC=0.78, F1=0.65)、焦虑(GAD-7 F1=0.50, AUC=0.77) 失眠(AIS AUC=0.73, F1=0.62)和疲劳(MFI 总分 F1=0.88, AUC=0.68)。使用 BDI 和 Fatigue 对不确定病例的禁欲率进行抑郁和疲劳检测时,也保持了这些优势(风险覆盖率 AUC <0.4)。单个症状评分的预测准确度较高(相关性均显著,皮尔逊强度在 0.31 和 0.49 之间)。公平性分析表明,模型在性别(平均差异比 (DR) = 0.86)、教育水平(平均差异比 (DR) = 0.47)和年龄组(平均差异比 (DR) = 0.33)方面具有一致性:本研究表明,语音系统具有在普通人群中进行多方面心理健康评估的潜力,不仅能检测临床阈值,还能估计其严重程度。解决公平性问题并将不确定性估计与选择性分类结合起来,是能够提高此类系统临床实用性和负责任实施的关键贡献。这种方法有望实现更准确、更细致的心理健康评估,使患者和临床医生都能从中受益:
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
14.40
自引率
5.40%
发文量
654
审稿时长
1 months
期刊介绍: The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信