{"title":"Acoustic signatures of depression elicited by emotion-based and theme-based speech tasks.","authors":"Qunxing Lin,Xiaohua Wu,Yueshiyuan Lei,Wanying Cheng,Shan Huang,Weijie Wang,Chong Li,Jiubo Zhao","doi":"10.1136/bmjment-2025-301858","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nMajor depressive disorder (MDD) remains underdiagnosed worldwide, partly due to reliance on self-reported symptoms and clinician-administered interviews.\r\n\r\nOBJECTIVE\r\nThis study examined whether a speech-based classification model using emotionally and thematically varied image-description tasks could effectively distinguish individuals with MDD from healthy controls.\r\n\r\nMETHODS\r\nA total of 120 participants (59 with MDD, 61 healthy controls) completed four speech tasks: three emotionally valenced images (positive, neutral, negative) and one Thematic Apperception Test (TAT) stimulus. Speech responses were segmented, and 23 acoustic features were extracted per sample. Classification was performed using a long short-term memory (LSTM) neural network, with SHapley Additive exPlanations (SHAP) applied for feature interpretation. Four traditional machine learning models (support vector machine, decision tree, k-nearest neighbour, random forest) served as comparators. Within-subject variation in speech duration was assessed with repeated-measures Analysis of Variance.\r\n\r\nFINDINGS\r\nThe LSTM model outperformed traditional classifiers, capturing temporal and dynamic speech patterns. The positive-valence image task achieved the highest accuracy (87.5%), followed by the negative-valence (85.0%), TAT (84.2%) and neutral-valence (81.7%) tasks. SHAP analysis highlighted task-specific contributions of pitch-related and spectral features. Significant differences in speech duration across tasks (p<0.01) indicated that affective valence influenced speech production.\r\n\r\nCONCLUSIONS\r\nEmotionally enriched and thematically ambiguous tasks enhanced automated MDD detection, with positive-valence stimuli providing the greatest discriminative power. SHAP interpretation underscored the importance of tailoring models to different speech inputs.\r\n\r\nCLINICAL IMPLICATIONS\r\nSpeech-based models incorporating emotionally evocative and projective stimuli offer a scalable, non-invasive approach for early depression screening. Their reliance on natural speech supports cross-cultural application and reduces stigma and literacy barriers. Broader validation is needed to facilitate integration into routine screening and monitoring.","PeriodicalId":72434,"journal":{"name":"BMJ mental health","volume":"19 1","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ mental health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjment-2025-301858","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"PSYCHIATRY","Score":null,"Total":0}
引用次数: 0
Abstract
BACKGROUND
Major depressive disorder (MDD) remains underdiagnosed worldwide, partly due to reliance on self-reported symptoms and clinician-administered interviews.
OBJECTIVE
This study examined whether a speech-based classification model using emotionally and thematically varied image-description tasks could effectively distinguish individuals with MDD from healthy controls.
METHODS
A total of 120 participants (59 with MDD, 61 healthy controls) completed four speech tasks: three emotionally valenced images (positive, neutral, negative) and one Thematic Apperception Test (TAT) stimulus. Speech responses were segmented, and 23 acoustic features were extracted per sample. Classification was performed using a long short-term memory (LSTM) neural network, with SHapley Additive exPlanations (SHAP) applied for feature interpretation. Four traditional machine learning models (support vector machine, decision tree, k-nearest neighbour, random forest) served as comparators. Within-subject variation in speech duration was assessed with repeated-measures Analysis of Variance.
FINDINGS
The LSTM model outperformed traditional classifiers, capturing temporal and dynamic speech patterns. The positive-valence image task achieved the highest accuracy (87.5%), followed by the negative-valence (85.0%), TAT (84.2%) and neutral-valence (81.7%) tasks. SHAP analysis highlighted task-specific contributions of pitch-related and spectral features. Significant differences in speech duration across tasks (p<0.01) indicated that affective valence influenced speech production.
CONCLUSIONS
Emotionally enriched and thematically ambiguous tasks enhanced automated MDD detection, with positive-valence stimuli providing the greatest discriminative power. SHAP interpretation underscored the importance of tailoring models to different speech inputs.
CLINICAL IMPLICATIONS
Speech-based models incorporating emotionally evocative and projective stimuli offer a scalable, non-invasive approach for early depression screening. Their reliance on natural speech supports cross-cultural application and reduces stigma and literacy barriers. Broader validation is needed to facilitate integration into routine screening and monitoring.