{"title":"基于情绪和主题的语音任务诱发抑郁的声学特征。","authors":"Qunxing Lin,Xiaohua Wu,Yueshiyuan Lei,Wanying Cheng,Shan Huang,Weijie Wang,Chong Li,Jiubo Zhao","doi":"10.1136/bmjment-2025-301858","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nMajor depressive disorder (MDD) remains underdiagnosed worldwide, partly due to reliance on self-reported symptoms and clinician-administered interviews.\r\n\r\nOBJECTIVE\r\nThis study examined whether a speech-based classification model using emotionally and thematically varied image-description tasks could effectively distinguish individuals with MDD from healthy controls.\r\n\r\nMETHODS\r\nA total of 120 participants (59 with MDD, 61 healthy controls) completed four speech tasks: three emotionally valenced images (positive, neutral, negative) and one Thematic Apperception Test (TAT) stimulus. Speech responses were segmented, and 23 acoustic features were extracted per sample. Classification was performed using a long short-term memory (LSTM) neural network, with SHapley Additive exPlanations (SHAP) applied for feature interpretation. Four traditional machine learning models (support vector machine, decision tree, k-nearest neighbour, random forest) served as comparators. Within-subject variation in speech duration was assessed with repeated-measures Analysis of Variance.\r\n\r\nFINDINGS\r\nThe LSTM model outperformed traditional classifiers, capturing temporal and dynamic speech patterns. The positive-valence image task achieved the highest accuracy (87.5%), followed by the negative-valence (85.0%), TAT (84.2%) and neutral-valence (81.7%) tasks. SHAP analysis highlighted task-specific contributions of pitch-related and spectral features. Significant differences in speech duration across tasks (p<0.01) indicated that affective valence influenced speech production.\r\n\r\nCONCLUSIONS\r\nEmotionally enriched and thematically ambiguous tasks enhanced automated MDD detection, with positive-valence stimuli providing the greatest discriminative power. SHAP interpretation underscored the importance of tailoring models to different speech inputs.\r\n\r\nCLINICAL IMPLICATIONS\r\nSpeech-based models incorporating emotionally evocative and projective stimuli offer a scalable, non-invasive approach for early depression screening. Their reliance on natural speech supports cross-cultural application and reduces stigma and literacy barriers. Broader validation is needed to facilitate integration into routine screening and monitoring.","PeriodicalId":72434,"journal":{"name":"BMJ mental health","volume":"19 1","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Acoustic signatures of depression elicited by emotion-based and theme-based speech tasks.\",\"authors\":\"Qunxing Lin,Xiaohua Wu,Yueshiyuan Lei,Wanying Cheng,Shan Huang,Weijie Wang,Chong Li,Jiubo Zhao\",\"doi\":\"10.1136/bmjment-2025-301858\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"BACKGROUND\\r\\nMajor depressive disorder (MDD) remains underdiagnosed worldwide, partly due to reliance on self-reported symptoms and clinician-administered interviews.\\r\\n\\r\\nOBJECTIVE\\r\\nThis study examined whether a speech-based classification model using emotionally and thematically varied image-description tasks could effectively distinguish individuals with MDD from healthy controls.\\r\\n\\r\\nMETHODS\\r\\nA total of 120 participants (59 with MDD, 61 healthy controls) completed four speech tasks: three emotionally valenced images (positive, neutral, negative) and one Thematic Apperception Test (TAT) stimulus. Speech responses were segmented, and 23 acoustic features were extracted per sample. Classification was performed using a long short-term memory (LSTM) neural network, with SHapley Additive exPlanations (SHAP) applied for feature interpretation. Four traditional machine learning models (support vector machine, decision tree, k-nearest neighbour, random forest) served as comparators. Within-subject variation in speech duration was assessed with repeated-measures Analysis of Variance.\\r\\n\\r\\nFINDINGS\\r\\nThe LSTM model outperformed traditional classifiers, capturing temporal and dynamic speech patterns. The positive-valence image task achieved the highest accuracy (87.5%), followed by the negative-valence (85.0%), TAT (84.2%) and neutral-valence (81.7%) tasks. SHAP analysis highlighted task-specific contributions of pitch-related and spectral features. Significant differences in speech duration across tasks (p<0.01) indicated that affective valence influenced speech production.\\r\\n\\r\\nCONCLUSIONS\\r\\nEmotionally enriched and thematically ambiguous tasks enhanced automated MDD detection, with positive-valence stimuli providing the greatest discriminative power. SHAP interpretation underscored the importance of tailoring models to different speech inputs.\\r\\n\\r\\nCLINICAL IMPLICATIONS\\r\\nSpeech-based models incorporating emotionally evocative and projective stimuli offer a scalable, non-invasive approach for early depression screening. Their reliance on natural speech supports cross-cultural application and reduces stigma and literacy barriers. Broader validation is needed to facilitate integration into routine screening and monitoring.\",\"PeriodicalId\":72434,\"journal\":{\"name\":\"BMJ mental health\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMJ mental health\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1136/bmjment-2025-301858\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"PSYCHIATRY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ mental health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjment-2025-301858","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"PSYCHIATRY","Score":null,"Total":0}
Acoustic signatures of depression elicited by emotion-based and theme-based speech tasks.
BACKGROUND
Major depressive disorder (MDD) remains underdiagnosed worldwide, partly due to reliance on self-reported symptoms and clinician-administered interviews.
OBJECTIVE
This study examined whether a speech-based classification model using emotionally and thematically varied image-description tasks could effectively distinguish individuals with MDD from healthy controls.
METHODS
A total of 120 participants (59 with MDD, 61 healthy controls) completed four speech tasks: three emotionally valenced images (positive, neutral, negative) and one Thematic Apperception Test (TAT) stimulus. Speech responses were segmented, and 23 acoustic features were extracted per sample. Classification was performed using a long short-term memory (LSTM) neural network, with SHapley Additive exPlanations (SHAP) applied for feature interpretation. Four traditional machine learning models (support vector machine, decision tree, k-nearest neighbour, random forest) served as comparators. Within-subject variation in speech duration was assessed with repeated-measures Analysis of Variance.
FINDINGS
The LSTM model outperformed traditional classifiers, capturing temporal and dynamic speech patterns. The positive-valence image task achieved the highest accuracy (87.5%), followed by the negative-valence (85.0%), TAT (84.2%) and neutral-valence (81.7%) tasks. SHAP analysis highlighted task-specific contributions of pitch-related and spectral features. Significant differences in speech duration across tasks (p<0.01) indicated that affective valence influenced speech production.
CONCLUSIONS
Emotionally enriched and thematically ambiguous tasks enhanced automated MDD detection, with positive-valence stimuli providing the greatest discriminative power. SHAP interpretation underscored the importance of tailoring models to different speech inputs.
CLINICAL IMPLICATIONS
Speech-based models incorporating emotionally evocative and projective stimuli offer a scalable, non-invasive approach for early depression screening. Their reliance on natural speech supports cross-cultural application and reduces stigma and literacy barriers. Broader validation is needed to facilitate integration into routine screening and monitoring.