Speech Communication最新文献

筛选
英文 中文
Progressive channel fusion for more efficient TDNN on speaker verification 渐进式信道融合可提高 TDNN 在扬声器验证方面的效率
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-07-23 DOI: 10.1016/j.specom.2024.103105
Zhenduo Zhao , Zhuo Li , Wenchao Wang , Ji Xu
{"title":"Progressive channel fusion for more efficient TDNN on speaker verification","authors":"Zhenduo Zhao ,&nbsp;Zhuo Li ,&nbsp;Wenchao Wang ,&nbsp;Ji Xu","doi":"10.1016/j.specom.2024.103105","DOIUrl":"10.1016/j.specom.2024.103105","url":null,"abstract":"<div><p>ECAPA-TDNN is one of the most popular TDNNs for speaker verification. While most of the updates pay attention to building precisely designed auxiliary modules, the depth-first principle has shown promising performance recently. However, empirical experiments show that one-dimensional convolution (Conv1D) based TDNNs suffer from performance degradation by simply adding massive vanilla basic blocks. Note that Conv1D naturally has a global receptive field (RF) on the feature dimension, progressive channel fusion (PCF) is proposed to alleviate this issue by introducing group convolution to build local RF and fusing the subbands progressively. Instead of reducing the group number in convolution layers used in the previous work, a novel channel permutation strategy is introduced to build information flow between groups so that all basic blocks in the model keep consistent parameter efficiency. The information leakage from lower-frequency bands to higher ones caused by Res2Block is simultaneously solved by introducing group-in-group convolution and using channel permutation. Besides the PCF strategy, redundant connections are removed for a more concise model architecture. The experiments on VoxCeleb and CnCeleb achieve state-of-the-art (SOTA) performance with an average relative improvement of 12.3% on EER and 13.2% on minDCF (0.01), validating the effectiveness of the proposed model.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103105"},"PeriodicalIF":2.4,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141960884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Decoupled structure for improved adaptability of end-to-end models 解耦结构可提高端到端模型的适应性
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-07-23 DOI: 10.1016/j.specom.2024.103109
Keqi Deng, Philip C. Woodland
{"title":"Decoupled structure for improved adaptability of end-to-end models","authors":"Keqi Deng,&nbsp;Philip C. Woodland","doi":"10.1016/j.specom.2024.103109","DOIUrl":"10.1016/j.specom.2024.103109","url":null,"abstract":"<div><p>Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data. To solve this problem, this paper proposes decoupled structures for attention-based encoder–decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component (i.e. internal LM) replaceable. When encountering a domain shift, the internal LM can be directly replaced during inference by a target-domain LM, without re-training or using domain-specific paired speech-text data. Experiments for E2E ASR models trained on the LibriSpeech-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions on the TED-LIUM 2 and AESRC2020 corpora while still maintaining performance on intra-domain data. It is also shown that the decoupled structure can be used to boost cross-domain speech translation quality while retaining the intra-domain performance.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103109"},"PeriodicalIF":2.4,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000803/pdfft?md5=7e35ebdc40ecd26754dcc103e392268c&pid=1-s2.0-S0167639324000803-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141943677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification Speechformer-CTC:利用语音时态分类对抑郁检测进行序列建模
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-07-18 DOI: 10.1016/j.specom.2024.103106
Jinhan Wang , Vijay Ravi , Jonathan Flint , Abeer Alwan
{"title":"Speechformer-CTC: Sequential modeling of depression detection with speech temporal classification","authors":"Jinhan Wang ,&nbsp;Vijay Ravi ,&nbsp;Jonathan Flint ,&nbsp;Abeer Alwan","doi":"10.1016/j.specom.2024.103106","DOIUrl":"10.1016/j.specom.2024.103106","url":null,"abstract":"<div><p>Speech-based automatic depression detection systems have been extensively explored over the past few years. Typically, each speaker is assigned a single label (Depressive or Non-depressive), and most approaches formulate depression detection as a speech classification task without explicitly considering the non-uniformly distributed depression pattern within segments, leading to low generalizability and robustness across different scenarios. However, depression corpora do not provide fine-grained labels (at the phoneme or word level) which makes the dynamic depression pattern in speech segments harder to track using conventional frameworks. To address this, we propose a novel framework, Speechformer-CTC, to model non-uniformly distributed depression characteristics within segments using a Connectionist Temporal Classification (CTC) objective function without the necessity of input–output alignment. Two novel CTC-label generation policies, namely the Expectation-One-Hot and the HuBERT policies, are proposed and incorporated in objectives on various granularities. Additionally, experiments using Automatic Speech Recognition (ASR) features are conducted to demonstrate the compatibility of the proposed method with content-based features. Our results show that the performance of depression detection, in terms of Macro F1-score, is improved on both DAIC-WOZ (English) and CONVERGE (Mandarin) datasets. On the DAIC-WOZ dataset, the system with HuBERT ASR features and a CTC objective optimized using HuBERT policy for label generation achieves 83.15% F1-score, which is close to state-of-the-art without the need for phoneme-level transcription or data augmentation. On the CONVERGE dataset, using Whisper features with the HuBERT policy improves the F1-score by 9.82% on CONVERGE1 (in-domain test set) and 18.47% on CONVERGE2 (out-of-domain test set). These findings show that depression detection can benefit from modeling non-uniformly distributed depression patterns and the proposed framework can be potentially used to determine significant depressive regions in speech utterances.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103106"},"PeriodicalIF":2.4,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000785/pdfft?md5=afe02da612b1e415b45579997ae4074e&pid=1-s2.0-S0167639324000785-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141842447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Whisper-SV: Adapting Whisper for low-data-resource speaker verification Whisper-SV:为低数据资源扬声器验证调整 Whisper
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-07-14 DOI: 10.1016/j.specom.2024.103103
Li Zhang , Ning Jiang , Qing Wang , Yue Li , Quan Lu , Lei Xie
{"title":"Whisper-SV: Adapting Whisper for low-data-resource speaker verification","authors":"Li Zhang ,&nbsp;Ning Jiang ,&nbsp;Qing Wang ,&nbsp;Yue Li ,&nbsp;Quan Lu ,&nbsp;Lei Xie","doi":"10.1016/j.specom.2024.103103","DOIUrl":"10.1016/j.specom.2024.103103","url":null,"abstract":"<div><p>Trained on 680,000 h of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103103"},"PeriodicalIF":2.4,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141701112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing speaker embedding learning: Wespeaker toolkit for research and production 推进演讲者嵌入式学习:用于研究和制作的 Wespeaker 工具包
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103104
Shuai Wang , Zhengyang Chen , Bing Han , Hongji Wang , Chengdong Liang , Binbin Zhang , Xu Xiang , Wen Ding , Johan Rohdin , Anna Silnova , Yanmin Qian , Haizhou Li
{"title":"Advancing speaker embedding learning: Wespeaker toolkit for research and production","authors":"Shuai Wang ,&nbsp;Zhengyang Chen ,&nbsp;Bing Han ,&nbsp;Hongji Wang ,&nbsp;Chengdong Liang ,&nbsp;Binbin Zhang ,&nbsp;Xu Xiang ,&nbsp;Wen Ding ,&nbsp;Johan Rohdin ,&nbsp;Anna Silnova ,&nbsp;Yanmin Qian ,&nbsp;Haizhou Li","doi":"10.1016/j.specom.2024.103104","DOIUrl":"10.1016/j.specom.2024.103104","url":null,"abstract":"<div><p>Speaker modeling plays a crucial role in various tasks, and fixed-dimensional vector representations, known as speaker embeddings, are the predominant modeling approach. These embeddings are typically evaluated within the framework of speaker verification, yet their utility extends to a broad scope of related tasks including speaker diarization, speech synthesis, voice conversion, and target speaker extraction. This paper presents Wespeaker, a user-friendly toolkit designed for both research and production purposes, dedicated to the learning of speaker embeddings. Wespeaker offers scalable data management, state-of-the-art speaker embedding models, and self-supervised learning training schemes with the potential to leverage large-scale unlabeled real-world data. The toolkit incorporates structured recipes that have been successfully adopted in winning systems across various speaker verification challenges, ensuring highly competitive results. For production-oriented development, Wespeaker integrates CPU- and GPU-compatible deployment and runtime codes, supporting mainstream platforms such as Windows, Linux, Mac and on-device chips such as horizon X3’PI. Wespeaker also provides off-the-shelf high-quality speaker embeddings by providing various pretrained models, which can be effortlessly applied to different tasks that require speaker modeling. The toolkit is publicly available at <span><span>https://github.com/wenet-e2e/wespeaker</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103104"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141688867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The effects of informational and energetic/modulation masking on the efficiency and ease of speech communication across the lifespan 信息和能量/调制掩蔽对一生中语言交流的效率和便利性的影响
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103101
Outi Tuomainen , Stuart Rosen , Linda Taschenberger , Valerie Hazan
{"title":"The effects of informational and energetic/modulation masking on the efficiency and ease of speech communication across the lifespan","authors":"Outi Tuomainen ,&nbsp;Stuart Rosen ,&nbsp;Linda Taschenberger ,&nbsp;Valerie Hazan","doi":"10.1016/j.specom.2024.103101","DOIUrl":"10.1016/j.specom.2024.103101","url":null,"abstract":"<div><p>Children and older adults have greater difficulty understanding speech when there are other voices in the background (informational masking, IM) than when the interference is a steady-state noise with a similar spectral profile but is not speech (due to modulation and energetic masking; EM/MM). We evaluated whether this IM vs. EM/MM difference for certain age ranges was found for broader measures of communication efficiency and ease in 114 participants aged between 8 and 80. Participants carried out interactive <em>diapix</em> problem-solving tasks in age-band- and sex-matched pairs, in quiet and with different maskers in the background affecting both participants. Three measures were taken: (a) task transaction time (communication efficiency), (b) performance on a secondary auditory task simultaneously carried out during <em>diapix</em>, and (c) post-test subjective ratings of effort, concentration, difficulty and noisiness (communication ease). Although participants did not take longer to complete the task when in challenging conditions, effects of IM vs. EM/MM were clearly seen on the other measures. Relative to the EM/MM and quiet conditions, participants in IM conditions were less able to attend to the secondary task and reported greater effects of the masker type on their perceived degree of effort, concentration, difficulty and noisiness. However, we found no evidence of decreased communication efficiency and ease in IM relative to EM/MM for children and older adults in any of our measures. The clearest effects of age were observed in transaction time and secondary task measures. Overall, communication efficiency gradually improved between the ages 8–18 years and performance on the secondary task improved over younger ages (until 30 years) and gradually decreased after 50 years of age. Finally, we also found an impact of communicative role on performance. In adults, the participant asked to take the lead in the task and who spoke the most, performed worse on the secondary task than the person who was mainly in a ‘listening’ role and responding to queries. These results suggest that when a broader evaluation of speech communication is carried out that more closely resembles typical communicative situations, the more acute effects of IM typically seen in populations at the extremes of the lifespan are minimised potentially due to the presence of multiple information sources, which allow the use of varying communication strategies. Such a finding is relevant for clinical evaluations of speech communication.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103101"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000736/pdfft?md5=3bae57a7e48911c3d00f77555ed9d386&pid=1-s2.0-S0167639324000736-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pathological voice classification using MEEL features and SVM-TabNet model 使用 MEEL 特征和 SVM-TabNet 模型进行病态语音分类
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103100
Mohammed Zakariah , Muna Al-Razgan , Taha Alfakih
{"title":"Pathological voice classification using MEEL features and SVM-TabNet model","authors":"Mohammed Zakariah ,&nbsp;Muna Al-Razgan ,&nbsp;Taha Alfakih","doi":"10.1016/j.specom.2024.103100","DOIUrl":"10.1016/j.specom.2024.103100","url":null,"abstract":"<div><p>In clinical settings, early diagnosis and objective assessment depend on the detection of voice pathology. To classify anomalous voices, this work uses an approach that combines the SVM-TabNet fusion model with MEEL (Mel-Frequency Energy Line) features. Further, the dataset consists of 1037 speech files, including recordings from people with laryngocele and Vox senilis as well as from healthy persons. Additionally, the main goal is to create an efficient classification model that can differentiate between normal and abnormal voice patterns. Modern techniques frequently lack the accuracy required for a precise diagnosis, which highlights the need for novel strategies. The suggested approach uses an SVM-TabNet fusion model for classification after feature extraction using MEEL characteristics. MEEL features provide extensive information for categorization by capturing complex patterns in audio transmissions. Moreover, by combining the advantages of SVM and TabNet models, classification performance is improved. Moreover, testing the model on test data yields remarkable results: 99.7 % accuracy, 0.992 F1 score, 0.996 precision, and 0.995 recall. Additional testing on additional datasets reliably validates outstanding performance, with 99.4 % accuracy, 0.99 F1 score, 0.998 precision, and 0.989 % recall. Furthermore, using the Saarbruecken Voice Database (SVD), the suggested methodology achieves an impressive accuracy of 99.97 %, demonstrating its durability and generalizability across many datasets. Overall, this work shows how the SVM-TabNet fusion model with MEEL characteristics may be used to accurately and consistently classify diseased voices, providing encouraging opportunities for clinical diagnosis and therapy tracking.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103100"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141571774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: A review 分析不同语音数据集和语音特征对语音情感识别的影响:综述
IF 2.4 3区 计算机科学
Speech Communication Pub Date : 2024-07-01 DOI: 10.1016/j.specom.2024.103102
Tarun Rathi, Manoj Tripathy
{"title":"Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: A review","authors":"Tarun Rathi,&nbsp;Manoj Tripathy","doi":"10.1016/j.specom.2024.103102","DOIUrl":"10.1016/j.specom.2024.103102","url":null,"abstract":"<div><p>Emotion recognition from speech has become crucial in human-computer interaction and affective computing applications. This review paper examines the complex relationship between two critical factors: the selection of speech data corpora and the extraction of speech features regarding speech emotion classification accuracy. Through an extensive analysis of literature from 2014 to 2023, publicly available speech datasets are explored and categorized based on their diversity, scale, linguistic attributes, and emotional classifications. The importance of various speech features, from basic spectral features to sophisticated prosodic cues, and their influence on emotion recognition accuracy is analyzed.. In the context of speech data corpora, this review paper unveils trends and insights from comparative studies exploring the repercussions of dataset choice on recognition efficacy. Various datasets such as IEMOCAP, EMODB, and MSP-IMPROV are scrutinized in terms of their influence on classifying the accuracy of the speech emotion recognition (SER) system. At the same time, potential challenges associated with dataset limitations are also examined. Notable features like Mel-frequency cepstral coefficients, pitch, intensity, and prosodic patterns are evaluated for their contributions to emotion recognition. Advanced feature extraction methods, too, are explored for their potential to capture intricate emotional dynamics. Moreover, this review paper offers insights into the methodological aspects of emotion recognition, shedding light on the diverse machine learning and deep learning approaches employed. Through a holistic synthesis of research findings, this review paper observes connections between the choice of speech data corpus, selection of speech features, and resulting emotion recognition accuracy. As the field continues to evolve, avenues for future research are proposed, ranging from enhanced feature extraction techniques to the development of standardized benchmark datasets. In essence, this review serves as a compass guiding researchers and practitioners through the intricate landscape of speech emotion recognition, offering a nuanced understanding of the factors shaping its recognition accuracy of speech emotion.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103102"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141637049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Emotions recognition in audio signals using an extension of the latent block model 利用潜块模型的扩展识别音频信号中的情绪
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-06-01 DOI: 10.1016/j.specom.2024.103092
Abir El Haj
{"title":"Emotions recognition in audio signals using an extension of the latent block model","authors":"Abir El Haj","doi":"10.1016/j.specom.2024.103092","DOIUrl":"10.1016/j.specom.2024.103092","url":null,"abstract":"<div><p>Emotion detection in human speech is a significant area of research, crucial for various applications such as affective computing and human–computer interaction. Despite advancements, accurately categorizing emotional states in speech remains challenging due to its subjective nature and the complexity of human emotions. To address this, we propose leveraging Mel frequency cepstral coefficients (MFCCS) and extend the latent block model (LBM) probabilistic clustering technique with a Gaussian multi-way latent block model (GMWLBM). Our objective is to categorize speech emotions into coherent groups based on the emotional states conveyed by speakers. We employ MFCCS from time-series audio data and utilize a variational Expectation Maximization method to estimate GMWLBM parameters. Additionally, we introduce an integrated Classification Likelihood (ICL) model selection criterion to determine the optimal number of clusters, enhancing robustness. Numerical experiments on real data from the Berlin Database of Emotional Speech (EMO-DB) demonstrate our method’s efficacy in accurately detecting and classifying emotional states in human speech, even in challenging real-world scenarios, thereby contributing significantly to affective computing and human–computer interaction applications.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103092"},"PeriodicalIF":3.2,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141278454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Summary of the DISPLACE challenge 2023-DIarization of SPeaker and LAnguage in Conversational Environments 2023 年 DISPLACE 挑战赛摘要--对话环境中的 SPeaker 和 LAnguage 个性化定制
IF 3.2 3区 计算机科学
Speech Communication Pub Date : 2024-05-11 DOI: 10.1016/j.specom.2024.103080
Shikha Baghel , Shreyas Ramoji , Somil Jain , Pratik Roy Chowdhuri , Prachi Singh , Deepu Vijayasenan , Sriram Ganapathy
{"title":"Summary of the DISPLACE challenge 2023-DIarization of SPeaker and LAnguage in Conversational Environments","authors":"Shikha Baghel ,&nbsp;Shreyas Ramoji ,&nbsp;Somil Jain ,&nbsp;Pratik Roy Chowdhuri ,&nbsp;Prachi Singh ,&nbsp;Deepu Vijayasenan ,&nbsp;Sriram Ganapathy","doi":"10.1016/j.specom.2024.103080","DOIUrl":"10.1016/j.specom.2024.103080","url":null,"abstract":"<div><p>In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The <strong>DISPLACE</strong> (DIarization of SPeaker and LAnguage in Conversational Environments) challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition. To facilitate this challenge, a real-world dataset featuring multilingual, multi-speaker conversational far-field speech was recorded and distributed. The challenge entailed two tracks: Track-1 focused on speaker diarization (SD) in multilingual situations while, Track-2 addressed the language diarization (LD) in a multi-speaker scenario. Both the tracks were evaluated using the same underlying audio data. Furthermore, a baseline system was made available for both SD and LD task which mimicked the state-of-art in these tasks. The challenge garnered a total of 42 world-wide registrations and received a total of 19 combined submissions for Track-1 and Track-2. This paper describes the challenge, details of the datasets, tasks, and the baseline system. Additionally, the paper provides a concise overview of the submitted systems in both tracks, with an emphasis given to the top performing systems. The paper also presents insights and future perspectives for SD and LD tasks, focusing on the key challenges that the systems need to overcome before wide-spread commercial deployment on such conversations.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103080"},"PeriodicalIF":3.2,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141054826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信