{"title":"Pathological voice classification using MEEL features and SVM-TabNet model","authors":"Mohammed Zakariah , Muna Al-Razgan , Taha Alfakih","doi":"10.1016/j.specom.2024.103100","DOIUrl":"10.1016/j.specom.2024.103100","url":null,"abstract":"<div><p>In clinical settings, early diagnosis and objective assessment depend on the detection of voice pathology. To classify anomalous voices, this work uses an approach that combines the SVM-TabNet fusion model with MEEL (Mel-Frequency Energy Line) features. Further, the dataset consists of 1037 speech files, including recordings from people with laryngocele and Vox senilis as well as from healthy persons. Additionally, the main goal is to create an efficient classification model that can differentiate between normal and abnormal voice patterns. Modern techniques frequently lack the accuracy required for a precise diagnosis, which highlights the need for novel strategies. The suggested approach uses an SVM-TabNet fusion model for classification after feature extraction using MEEL characteristics. MEEL features provide extensive information for categorization by capturing complex patterns in audio transmissions. Moreover, by combining the advantages of SVM and TabNet models, classification performance is improved. Moreover, testing the model on test data yields remarkable results: 99.7 % accuracy, 0.992 F1 score, 0.996 precision, and 0.995 recall. Additional testing on additional datasets reliably validates outstanding performance, with 99.4 % accuracy, 0.99 F1 score, 0.998 precision, and 0.989 % recall. Furthermore, using the Saarbruecken Voice Database (SVD), the suggested methodology achieves an impressive accuracy of 99.97 %, demonstrating its durability and generalizability across many datasets. Overall, this work shows how the SVM-TabNet fusion model with MEEL characteristics may be used to accurately and consistently classify diseased voices, providing encouraging opportunities for clinical diagnosis and therapy tracking.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103100"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141571774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing the influence of different speech data corpora and speech features on speech emotion recognition: A review","authors":"Tarun Rathi, Manoj Tripathy","doi":"10.1016/j.specom.2024.103102","DOIUrl":"10.1016/j.specom.2024.103102","url":null,"abstract":"<div><p>Emotion recognition from speech has become crucial in human-computer interaction and affective computing applications. This review paper examines the complex relationship between two critical factors: the selection of speech data corpora and the extraction of speech features regarding speech emotion classification accuracy. Through an extensive analysis of literature from 2014 to 2023, publicly available speech datasets are explored and categorized based on their diversity, scale, linguistic attributes, and emotional classifications. The importance of various speech features, from basic spectral features to sophisticated prosodic cues, and their influence on emotion recognition accuracy is analyzed.. In the context of speech data corpora, this review paper unveils trends and insights from comparative studies exploring the repercussions of dataset choice on recognition efficacy. Various datasets such as IEMOCAP, EMODB, and MSP-IMPROV are scrutinized in terms of their influence on classifying the accuracy of the speech emotion recognition (SER) system. At the same time, potential challenges associated with dataset limitations are also examined. Notable features like Mel-frequency cepstral coefficients, pitch, intensity, and prosodic patterns are evaluated for their contributions to emotion recognition. Advanced feature extraction methods, too, are explored for their potential to capture intricate emotional dynamics. Moreover, this review paper offers insights into the methodological aspects of emotion recognition, shedding light on the diverse machine learning and deep learning approaches employed. Through a holistic synthesis of research findings, this review paper observes connections between the choice of speech data corpus, selection of speech features, and resulting emotion recognition accuracy. As the field continues to evolve, avenues for future research are proposed, ranging from enhanced feature extraction techniques to the development of standardized benchmark datasets. In essence, this review serves as a compass guiding researchers and practitioners through the intricate landscape of speech emotion recognition, offering a nuanced understanding of the factors shaping its recognition accuracy of speech emotion.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"162 ","pages":"Article 103102"},"PeriodicalIF":2.4,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141637049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Emotions recognition in audio signals using an extension of the latent block model","authors":"Abir El Haj","doi":"10.1016/j.specom.2024.103092","DOIUrl":"10.1016/j.specom.2024.103092","url":null,"abstract":"<div><p>Emotion detection in human speech is a significant area of research, crucial for various applications such as affective computing and human–computer interaction. Despite advancements, accurately categorizing emotional states in speech remains challenging due to its subjective nature and the complexity of human emotions. To address this, we propose leveraging Mel frequency cepstral coefficients (MFCCS) and extend the latent block model (LBM) probabilistic clustering technique with a Gaussian multi-way latent block model (GMWLBM). Our objective is to categorize speech emotions into coherent groups based on the emotional states conveyed by speakers. We employ MFCCS from time-series audio data and utilize a variational Expectation Maximization method to estimate GMWLBM parameters. Additionally, we introduce an integrated Classification Likelihood (ICL) model selection criterion to determine the optimal number of clusters, enhancing robustness. Numerical experiments on real data from the Berlin Database of Emotional Speech (EMO-DB) demonstrate our method’s efficacy in accurately detecting and classifying emotional states in human speech, even in challenging real-world scenarios, thereby contributing significantly to affective computing and human–computer interaction applications.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103092"},"PeriodicalIF":3.2,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141278454","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Summary of the DISPLACE challenge 2023-DIarization of SPeaker and LAnguage in Conversational Environments","authors":"Shikha Baghel , Shreyas Ramoji , Somil Jain , Pratik Roy Chowdhuri , Prachi Singh , Deepu Vijayasenan , Sriram Ganapathy","doi":"10.1016/j.specom.2024.103080","DOIUrl":"10.1016/j.specom.2024.103080","url":null,"abstract":"<div><p>In multi-lingual societies, where multiple languages are spoken in a small geographic vicinity, informal conversations often involve mix of languages. Existing speech technologies may be inefficient in extracting information from such conversations, where the speech data is rich in diversity with multiple languages and speakers. The <strong>DISPLACE</strong> (DIarization of SPeaker and LAnguage in Conversational Environments) challenge constitutes an open-call for evaluating and bench-marking the speaker and language diarization technologies on this challenging condition. To facilitate this challenge, a real-world dataset featuring multilingual, multi-speaker conversational far-field speech was recorded and distributed. The challenge entailed two tracks: Track-1 focused on speaker diarization (SD) in multilingual situations while, Track-2 addressed the language diarization (LD) in a multi-speaker scenario. Both the tracks were evaluated using the same underlying audio data. Furthermore, a baseline system was made available for both SD and LD task which mimicked the state-of-art in these tasks. The challenge garnered a total of 42 world-wide registrations and received a total of 19 combined submissions for Track-1 and Track-2. This paper describes the challenge, details of the datasets, tasks, and the baseline system. Additionally, the paper provides a concise overview of the submitted systems in both tracks, with an emphasis given to the top performing systems. The paper also presents insights and future perspectives for SD and LD tasks, focusing on the key challenges that the systems need to overcome before wide-spread commercial deployment on such conversations.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103080"},"PeriodicalIF":3.2,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141054826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"End-to-end integration of speech separation and voice activity detection for low-latency diarization of telephone conversations","authors":"Giovanni Morrone , Samuele Cornell , Luca Serafini , Enrico Zovato , Alessio Brutti , Stefano Squartini","doi":"10.1016/j.specom.2024.103081","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103081","url":null,"abstract":"<div><p>Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103081"},"PeriodicalIF":3.2,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141078094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ping Tang, Shanpeng Li, Yanan Shen, Qianxi Yu, Yan Feng
{"title":"Visual-articulatory cues facilitate children with CIs to better perceive Mandarin tones in sentences","authors":"Ping Tang, Shanpeng Li, Yanan Shen, Qianxi Yu, Yan Feng","doi":"10.1016/j.specom.2024.103084","DOIUrl":"10.1016/j.specom.2024.103084","url":null,"abstract":"<div><p>Children with cochlear implants (CIs) face challenges in tonal perception under noise. Nevertheless, our previous research demonstrated that seeing visual-articulatory cues (speakers’ facial/head movements) benefited these children to perceive isolated tones better, particularly in noisy environments, with those implanted earlier gaining more benefits. However, tones in daily speech typically occur in sentence contexts where visual cues are largely reduced compared to those in isolated contexts. It was thus unclear if visual benefits on tonal perception still hold in these challenging sentence contexts. Therefore, this study tested 64 children with CIs and 64 age-matched NH children. Target tones in sentence-medial position were presented in audio-only (AO) or audiovisual (AV) conditions, in quiet and noisy environments. Children selected the target tone using a picture-point task. The results showed that, while NH children did not show any perception difference between AO and AV conditions, children with CIs significantly improved their perceptual accuracy from AO to AV conditions. The degree of improvement was negatively correlated with their implantation ages. Therefore, children with CIs were able to use visual-articulatory cues to facilitate their tonal perception even in sentence contexts, and earlier auditory experience might be important in shaping this ability.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103084"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141028923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dina El Zarka , Anneliese Kelterer , Michele Gubian , Barbara Schuppler
{"title":"The prosody of theme, rheme and focus in Egyptian Arabic: A quantitative investigation of tunes, configurations and speaker variability","authors":"Dina El Zarka , Anneliese Kelterer , Michele Gubian , Barbara Schuppler","doi":"10.1016/j.specom.2024.103082","DOIUrl":"10.1016/j.specom.2024.103082","url":null,"abstract":"<div><p>This paper investigates the prosody of sentences elicited in three Information Structure (IS) conditions: all-new, theme-rheme and rhematic focus-background. The sentences were produced by 18 speakers of Egyptian Arabic (EA). This is the first quantitative study to provide a comprehensive analysis of holistic f0 contours (by means of GAMM) and configurations of f0, duration and intensity (by means of FPCA) associated with the three IS conditions, both across and within speakers. A significant difference between focus-background and the other information structure conditions was found, but also strong inter-speaker variation in terms of strategies and the degree to which these strategies were applied. The results suggest that post-focus register lowering and the duration of the stressed syllables of the focused and the utterance-final word are more consistent cues to focus than a higher peak of the focus accent. In addition, some independence of duration and intensity from f0 could be identified. These results thus support the assumption that, when focus is marked prosodically in EA, it is marked by prominence. Nevertheless, the fact that a considerable number of EA speakers did not apply prosodic marking and the fact that prosodic focus marking was gradient rather than categorical suggest that EA does not have a fully conventionalized prosodic focus construction.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103082"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000542/pdfft?md5=dcb4ae8365c4f0e84a5827d3ae202551&pid=1-s2.0-S0167639324000542-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141035839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanli Tian , Zehan Li , Zhaobiao Lyv , Gaofeng Cheng , Qing Xiao , Ta Li , Qingwei Zhao
{"title":"Factorized and progressive knowledge distillation for CTC-based ASR models","authors":"Sanli Tian , Zehan Li , Zhaobiao Lyv , Gaofeng Cheng , Qing Xiao , Ta Li , Qingwei Zhao","doi":"10.1016/j.specom.2024.103071","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103071","url":null,"abstract":"<div><p>Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non-blank and blank frames differently for two main reasons. First, the non-blank frames in the teacher model’s posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non-blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non-blank tokens in the teacher’s blank-frame posteriors exhibit irregular probability distributions, negatively impacting the student model’s learning. Thus, we propose to factorize the distillation of non-blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non-blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representation-based KD, in which hidden representations are divided into non-blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher’s posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non-blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross-model topology KD.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103071"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140879835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimization-based planning of speech articulation using general Tau Theory","authors":"Benjamin Elie , Juraj Šimko , Alice Turk","doi":"10.1016/j.specom.2024.103083","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103083","url":null,"abstract":"<div><p>This paper presents a model of speech articulation planning and generation based on General Tau Theory and Optimal Control Theory. Because General Tau Theory assumes that articulatory targets are always reached, the model accounts for speech variation via context-dependent articulatory targets. Targets are chosen via the optimization of a composite objective function. This function models three different task requirements: maximal intelligibility, minimal articulatory effort and minimal utterance duration. The paper shows that systematic phonetic variability can be reproduced by adjusting the weights assigned to each task requirement. Weights can be adjusted globally to simulate different speech styles, and can be adjusted locally to simulate different levels of prosodic prominence. The solution of the optimization procedure contains Tau equation parameter values for each articulatory movement, namely position of the articulator at the movement offset, movement duration, and a parameter which relates to the shape of the movement’s velocity profile. The paper presents simulations which illustrate the ability of the model to predict or reproduce several well-known characteristics of speech. These phenomena include close-to-symmetric velocity profiles for articulatory movement, variation related to speech rate, centralization of unstressed vowels, lengthening of stressed vowels, lenition of unstressed lingual stop consonants, and coarticulation of stop consonants.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103083"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000554/pdfft?md5=9244f2762d9cdb76bf74cf04a57a092e&pid=1-s2.0-S0167639324000554-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140948784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Chinese speech intelligibility and speech intelligibility index for the elderly","authors":"Jiazhong Zeng , Jianxin Peng , Shuyin Xiang","doi":"10.1016/j.specom.2024.103072","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103072","url":null,"abstract":"<div><p>The speech intelligibility index (SII) and speech transmission index (STI) are widely accepted objective metrics for assessing speech intelligibility. In previous work, the relationship between STI and Chinese speech intelligibility (CSI) scores was studied. In this paper, the relationship between SII and CSI scores in rooms for the elderly aged 60–69 and over 70 is investigated by using auralization method under different background noise levels (40dBA and 55dBA) and different reverberation times. The results show that SII has good correlation with CSI score of the elderly. To get the same CSI score as the young adults, the elderly need a larger SII value, and the value increases with the increase of the age for the elderly. Since hearing loss of the elderly is considered in the calculation of SII, the difference in the required SII between the elderly and young is less than that of the required STI under the same CSI score condition. This indicates that SII is a more consistent evaluation criterion for different ages.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103072"},"PeriodicalIF":3.2,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140638630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}