S Regondi, F Roncone, V Colombo, R Pugliese, E Bagli, G Russo, A Panella, M Radavelli, S Bolognini
{"title":"Voice of Mind,一个从声学和词汇声乐生物标记物评估抑郁和焦虑的深度学习模型。","authors":"S Regondi, F Roncone, V Colombo, R Pugliese, E Bagli, G Russo, A Panella, M Radavelli, S Bolognini","doi":"10.1016/j.jvoice.2025.09.012","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To develop a deep learning model to assess anxiety and depression from acoustic and lexical biomarkers able to analyze Italian psychotherapy recordings and classify three distinct conditions: depression, anxiety, and no pathology.</p><p><strong>Method: </strong>Five patients diagnosed with either Major Depressive Disorder or Generalized Anxiety Disorder were selected from psychotherapy sessions conducted at RAM Psyche. A total of seven audio recordings were manually analyzed by a clinical psychologist using the DASS-21 scale, resulting in over 1000 audio segments labeled for psychopathological content. From these recordings, acoustic features and lexical markers were extracted. These features were processed through a hybrid architecture combining a Convolutional Neural Network for Mel spectrogram analysis and a Multi-Layer Perceptron for integrating lexical and acoustic inputs. Three model variants (VOM 1.1, 1.2, and 1.3) were trained and evaluated using two custom datasets (DVOM2, DVOM3), including both internal patient audio and external neutral voices.</p><p><strong>Results: </strong>The model successfully classified segments into depression, anxiety, and no pathology with promising results. Feature importance analysis revealed that prosodic cues such as lower pitch, reduced intensity, and increased pauses were highly predictive of depression, while lexical richness and adverb usage were associated with both disorders. Among the model variants, VOM 1.1 showed balanced performance across all three classes, particularly excelling in detecting depression and no pathology. In contrast, VOM 1.2 prioritized depression and anxiety detection, occasionally misclassifying ambiguous cases as symptomatic, suggesting a heightened sensitivity to subtle pathological cues. VOM 1.3 while maintaining a strong classification performance, demonstrated improved robustness on external neutral voices.</p><p><strong>Conclusions: </strong>The Voice of Mind model demonstrates the feasibility of using speech data to support mental health diagnostics. Its capacity to distinguish between depression and anxiety, while maintaining generalization across nonpathological voices, suggests its potential as a clinical decision-support tool.</p>","PeriodicalId":49954,"journal":{"name":"Journal of Voice","volume":" ","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Voice of Mind, a Deep Learning Model for Depression and Anxiety Assessment From Acoustic and Lexical Vocal Biomarkers.\",\"authors\":\"S Regondi, F Roncone, V Colombo, R Pugliese, E Bagli, G Russo, A Panella, M Radavelli, S Bolognini\",\"doi\":\"10.1016/j.jvoice.2025.09.012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>To develop a deep learning model to assess anxiety and depression from acoustic and lexical biomarkers able to analyze Italian psychotherapy recordings and classify three distinct conditions: depression, anxiety, and no pathology.</p><p><strong>Method: </strong>Five patients diagnosed with either Major Depressive Disorder or Generalized Anxiety Disorder were selected from psychotherapy sessions conducted at RAM Psyche. A total of seven audio recordings were manually analyzed by a clinical psychologist using the DASS-21 scale, resulting in over 1000 audio segments labeled for psychopathological content. From these recordings, acoustic features and lexical markers were extracted. These features were processed through a hybrid architecture combining a Convolutional Neural Network for Mel spectrogram analysis and a Multi-Layer Perceptron for integrating lexical and acoustic inputs. Three model variants (VOM 1.1, 1.2, and 1.3) were trained and evaluated using two custom datasets (DVOM2, DVOM3), including both internal patient audio and external neutral voices.</p><p><strong>Results: </strong>The model successfully classified segments into depression, anxiety, and no pathology with promising results. Feature importance analysis revealed that prosodic cues such as lower pitch, reduced intensity, and increased pauses were highly predictive of depression, while lexical richness and adverb usage were associated with both disorders. Among the model variants, VOM 1.1 showed balanced performance across all three classes, particularly excelling in detecting depression and no pathology. In contrast, VOM 1.2 prioritized depression and anxiety detection, occasionally misclassifying ambiguous cases as symptomatic, suggesting a heightened sensitivity to subtle pathological cues. VOM 1.3 while maintaining a strong classification performance, demonstrated improved robustness on external neutral voices.</p><p><strong>Conclusions: </strong>The Voice of Mind model demonstrates the feasibility of using speech data to support mental health diagnostics. Its capacity to distinguish between depression and anxiety, while maintaining generalization across nonpathological voices, suggests its potential as a clinical decision-support tool.</p>\",\"PeriodicalId\":49954,\"journal\":{\"name\":\"Journal of Voice\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Voice\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.jvoice.2025.09.012\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Voice","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jvoice.2025.09.012","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY","Score":null,"Total":0}
Voice of Mind, a Deep Learning Model for Depression and Anxiety Assessment From Acoustic and Lexical Vocal Biomarkers.
Objective: To develop a deep learning model to assess anxiety and depression from acoustic and lexical biomarkers able to analyze Italian psychotherapy recordings and classify three distinct conditions: depression, anxiety, and no pathology.
Method: Five patients diagnosed with either Major Depressive Disorder or Generalized Anxiety Disorder were selected from psychotherapy sessions conducted at RAM Psyche. A total of seven audio recordings were manually analyzed by a clinical psychologist using the DASS-21 scale, resulting in over 1000 audio segments labeled for psychopathological content. From these recordings, acoustic features and lexical markers were extracted. These features were processed through a hybrid architecture combining a Convolutional Neural Network for Mel spectrogram analysis and a Multi-Layer Perceptron for integrating lexical and acoustic inputs. Three model variants (VOM 1.1, 1.2, and 1.3) were trained and evaluated using two custom datasets (DVOM2, DVOM3), including both internal patient audio and external neutral voices.
Results: The model successfully classified segments into depression, anxiety, and no pathology with promising results. Feature importance analysis revealed that prosodic cues such as lower pitch, reduced intensity, and increased pauses were highly predictive of depression, while lexical richness and adverb usage were associated with both disorders. Among the model variants, VOM 1.1 showed balanced performance across all three classes, particularly excelling in detecting depression and no pathology. In contrast, VOM 1.2 prioritized depression and anxiety detection, occasionally misclassifying ambiguous cases as symptomatic, suggesting a heightened sensitivity to subtle pathological cues. VOM 1.3 while maintaining a strong classification performance, demonstrated improved robustness on external neutral voices.
Conclusions: The Voice of Mind model demonstrates the feasibility of using speech data to support mental health diagnostics. Its capacity to distinguish between depression and anxiety, while maintaining generalization across nonpathological voices, suggests its potential as a clinical decision-support tool.
期刊介绍:
The Journal of Voice is widely regarded as the world''s premiere journal for voice medicine and research. This peer-reviewed publication is listed in Index Medicus and is indexed by the Institute for Scientific Information. The journal contains articles written by experts throughout the world on all topics in voice sciences, voice medicine and surgery, and speech-language pathologists'' management of voice-related problems. The journal includes clinical articles, clinical research, and laboratory research. Members of the Foundation receive the journal as a benefit of membership.