Speechformer-CTC：利用语音时态分类对抑郁检测进行序列建模

IF 3 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2024-07-18 DOI:10.1016/j.specom.2024.103106

Jinhan Wang , Vijay Ravi , Jonathan Flint , Abeer Alwan

{"title":"Speechformer-CTC：利用语音时态分类对抑郁检测进行序列建模","authors":"Jinhan Wang , Vijay Ravi , Jonathan Flint , Abeer Alwan","doi":"10.1016/j.specom.2024.103106","DOIUrl":null,"url":null,"abstract":"<div><p>Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input–output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103106"},"PeriodicalIF":3.0000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000785/pdfft?md5=afe02da612b1e415b45579997ae4074e&pid=1-s2.0-S0167639324000785-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification\",\"authors\":\"Jinhan Wang , Vijay Ravi , Jonathan Flint , Abeer Alwan\",\"doi\":\"10.1016/j.specom.2024.103106\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input–output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"163 \",\"pages\":\"Article 103106\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000785/pdfft?md5=afe02da612b1e415b45579997ae4074e&pid=1-s2.0-S0167639324000785-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000785\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000785","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

基于语音的抑郁自动检测系统在过去几年中得到了广泛的探索。通常情况下，每个说话者都会被赋予一个单一的标签（抑郁或非抑郁），而且大多数方法都将抑郁检测作为一项语音分类任务，而没有明确考虑片段内非均匀分布的抑郁模式，从而导致在不同场景下的通用性和鲁棒性较低。然而，抑郁语料库不提供细粒度标签（音素或单词级别），这使得使用传统框架跟踪语音片段中的动态抑郁模式变得更加困难。为了解决这个问题，我们提出了一个新颖的框架 Speechformer-CTC，利用 Connectionist Temporal Classification (CTC) 目标函数对片段内非均匀分布的抑郁特征进行建模，而无需输入输出对齐。提出了两种新颖的 CTC 标签生成策略，即期望一热策略和 HuBERT 策略，并将其纳入不同粒度的目标中。此外，还使用自动语音识别（ASR）特征进行了实验，以证明所提方法与基于内容的特征的兼容性。我们的结果表明，在 DAIC-WOZ（英语）和 CONVERGE（普通话）数据集上，抑郁检测的性能（宏观 F1 分数）都得到了提高。在 DAIC-WOZ 数据集上，采用 HuBERT ASR 特征和使用 HuBERT 策略优化标签生成的 CTC 目标的系统取得了 83.15% 的 F1 分数，接近最先进水平，无需进行音素级转录或数据增强。在 CONVERGE 数据集上，使用 Whisper 特征和 HuBERT 策略可将 CONVERGE1（域内测试集）的 F1 分数提高 9.82%，将 CONVERGE2（域外测试集）的 F1 分数提高 18.47%。这些研究结果表明，抑郁检测可以从非均匀分布的抑郁模式建模中获益，所提出的框架可用于确定语音语篇中的重要抑郁区域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification

Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input–output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.