{"title":"End-to-end integration of speech separation and voice activity detection for low-latency diarization of telephone conversations","authors":"Giovanni Morrone , Samuele Cornell , Luca Serafini , Enrico Zovato , Alessio Brutti , Stefano Squartini","doi":"10.1016/j.specom.2024.103081","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103081","url":null,"abstract":"<div><p>Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 s. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"161 ","pages":"Article 103081"},"PeriodicalIF":3.2,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141078094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ping Tang, Shanpeng Li, Yanan Shen, Qianxi Yu, Yan Feng
{"title":"Visual-articulatory cues facilitate children with CIs to better perceive Mandarin tones in sentences","authors":"Ping Tang, Shanpeng Li, Yanan Shen, Qianxi Yu, Yan Feng","doi":"10.1016/j.specom.2024.103084","DOIUrl":"10.1016/j.specom.2024.103084","url":null,"abstract":"<div><p>Children with cochlear implants (CIs) face challenges in tonal perception under noise. Nevertheless, our previous research demonstrated that seeing visual-articulatory cues (speakers’ facial/head movements) benefited these children to perceive isolated tones better, particularly in noisy environments, with those implanted earlier gaining more benefits. However, tones in daily speech typically occur in sentence contexts where visual cues are largely reduced compared to those in isolated contexts. It was thus unclear if visual benefits on tonal perception still hold in these challenging sentence contexts. Therefore, this study tested 64 children with CIs and 64 age-matched NH children. Target tones in sentence-medial position were presented in audio-only (AO) or audiovisual (AV) conditions, in quiet and noisy environments. Children selected the target tone using a picture-point task. The results showed that, while NH children did not show any perception difference between AO and AV conditions, children with CIs significantly improved their perceptual accuracy from AO to AV conditions. The degree of improvement was negatively correlated with their implantation ages. Therefore, children with CIs were able to use visual-articulatory cues to facilitate their tonal perception even in sentence contexts, and earlier auditory experience might be important in shaping this ability.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103084"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141028923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dina El Zarka , Anneliese Kelterer , Michele Gubian , Barbara Schuppler
{"title":"The prosody of theme, rheme and focus in Egyptian Arabic: A quantitative investigation of tunes, configurations and speaker variability","authors":"Dina El Zarka , Anneliese Kelterer , Michele Gubian , Barbara Schuppler","doi":"10.1016/j.specom.2024.103082","DOIUrl":"10.1016/j.specom.2024.103082","url":null,"abstract":"<div><p>This paper investigates the prosody of sentences elicited in three Information Structure (IS) conditions: all-new, theme-rheme and rhematic focus-background. The sentences were produced by 18 speakers of Egyptian Arabic (EA). This is the first quantitative study to provide a comprehensive analysis of holistic f0 contours (by means of GAMM) and configurations of f0, duration and intensity (by means of FPCA) associated with the three IS conditions, both across and within speakers. A significant difference between focus-background and the other information structure conditions was found, but also strong inter-speaker variation in terms of strategies and the degree to which these strategies were applied. The results suggest that post-focus register lowering and the duration of the stressed syllables of the focused and the utterance-final word are more consistent cues to focus than a higher peak of the focus accent. In addition, some independence of duration and intensity from f0 could be identified. These results thus support the assumption that, when focus is marked prosodically in EA, it is marked by prominence. Nevertheless, the fact that a considerable number of EA speakers did not apply prosodic marking and the fact that prosodic focus marking was gradient rather than categorical suggest that EA does not have a fully conventionalized prosodic focus construction.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103082"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000542/pdfft?md5=dcb4ae8365c4f0e84a5827d3ae202551&pid=1-s2.0-S0167639324000542-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141035839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanli Tian , Zehan Li , Zhaobiao Lyv , Gaofeng Cheng , Qing Xiao , Ta Li , Qingwei Zhao
{"title":"Factorized and progressive knowledge distillation for CTC-based ASR models","authors":"Sanli Tian , Zehan Li , Zhaobiao Lyv , Gaofeng Cheng , Qing Xiao , Ta Li , Qingwei Zhao","doi":"10.1016/j.specom.2024.103071","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103071","url":null,"abstract":"<div><p>Knowledge distillation (KD) is a popular model compression method to improve the performance of lightweight models by transferring knowledge from a teacher model to a student model. However, applying KD to connectionist temporal classification (CTC) ASR model is challenging due to its peaky posterior property. In this paper, we propose to address this issue by treating non-blank and blank frames differently for two main reasons. First, the non-blank frames in the teacher model’s posterior matrix and hidden representations provide more acoustic and linguistic information than the blank frames, but the frame number of non-blank frames only accounts for a small fraction of all frames, leading to a severe learning imbalance problem. Second, the non-blank tokens in the teacher’s blank-frame posteriors exhibit irregular probability distributions, negatively impacting the student model’s learning. Thus, we propose to factorize the distillation of non-blank and blank frames and further combine them into a progressive KD framework, which contains three incremental stages to facilitate the student model gradually building up its knowledge. The first stage involves a simple binary classification KD task, in which the student learns to distinguish between non-blank and blank frames, as the two types of frames are learned separately in subsequent stages. The second stage is a factorized representation-based KD, in which hidden representations are divided into non-blank and blank frames so that both can be distilled in a balanced manner. In the third stage, the student learns from the teacher’s posterior matrix through our proposed method, factorized KL-divergence (FKL), which performs different operation on blank and non-blank frame posteriors to alleviate the imbalance issue and reduce the influence of irregular probability distributions. Compared to the baseline, our proposed method achieves 22.5% relative CER reduction on the Aishell-1 dataset, 23.0% relative WER reduction on the Tedlium-2 dataset, and 17.6% relative WER reduction on the LibriSpeech dataset. To show the generalization of our method, we also evaluate our method on the hybrid CTC/Attention architecture as well as on scenarios with cross-model topology KD.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103071"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140879835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimization-based planning of speech articulation using general Tau Theory","authors":"Benjamin Elie , Juraj Šimko , Alice Turk","doi":"10.1016/j.specom.2024.103083","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103083","url":null,"abstract":"<div><p>This paper presents a model of speech articulation planning and generation based on General Tau Theory and Optimal Control Theory. Because General Tau Theory assumes that articulatory targets are always reached, the model accounts for speech variation via context-dependent articulatory targets. Targets are chosen via the optimization of a composite objective function. This function models three different task requirements: maximal intelligibility, minimal articulatory effort and minimal utterance duration. The paper shows that systematic phonetic variability can be reproduced by adjusting the weights assigned to each task requirement. Weights can be adjusted globally to simulate different speech styles, and can be adjusted locally to simulate different levels of prosodic prominence. The solution of the optimization procedure contains Tau equation parameter values for each articulatory movement, namely position of the articulator at the movement offset, movement duration, and a parameter which relates to the shape of the movement’s velocity profile. The paper presents simulations which illustrate the ability of the model to predict or reproduce several well-known characteristics of speech. These phenomena include close-to-symmetric velocity profiles for articulatory movement, variation related to speech rate, centralization of unstressed vowels, lengthening of stressed vowels, lenition of unstressed lingual stop consonants, and coarticulation of stop consonants.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103083"},"PeriodicalIF":3.2,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000554/pdfft?md5=9244f2762d9cdb76bf74cf04a57a092e&pid=1-s2.0-S0167639324000554-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140948784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Chinese speech intelligibility and speech intelligibility index for the elderly","authors":"Jiazhong Zeng , Jianxin Peng , Shuyin Xiang","doi":"10.1016/j.specom.2024.103072","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103072","url":null,"abstract":"<div><p>The speech intelligibility index (SII) and speech transmission index (STI) are widely accepted objective metrics for assessing speech intelligibility. In previous work, the relationship between STI and Chinese speech intelligibility (CSI) scores was studied. In this paper, the relationship between SII and CSI scores in rooms for the elderly aged 60–69 and over 70 is investigated by using auralization method under different background noise levels (40dBA and 55dBA) and different reverberation times. The results show that SII has good correlation with CSI score of the elderly. To get the same CSI score as the young adults, the elderly need a larger SII value, and the value increases with the increase of the age for the elderly. Since hearing loss of the elderly is considered in the calculation of SII, the difference in the required SII between the elderly and young is less than that of the required STI under the same CSI score condition. This indicates that SII is a more consistent evaluation criterion for different ages.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103072"},"PeriodicalIF":3.2,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140638630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shinimol Salim , Syed Shahnawazuddin , Waquar Ahmad
{"title":"Combined approach to dysarthric speaker verification using data augmentation and feature fusion","authors":"Shinimol Salim , Syed Shahnawazuddin , Waquar Ahmad","doi":"10.1016/j.specom.2024.103070","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103070","url":null,"abstract":"<div><p>In this study, the challenges of adapting automatic speaker verification (ASV) systems to accommodate individuals with dysarthria, a speech disorder affecting intelligibility and articulation, are addressed. The scarcity of dysarthric speech data presents a significant obstacle in the development of an effective ASV system. To mitigate the detrimental effects of data paucity, an out-of-domain data augmentation approach was employed based on the observation that dysarthric speech often exhibits longer phoneme duration. Motivated by this observation, the duration of healthy speech data was modified with various stretching factors and then pooled into training, resulting in a significant reduction in the error rate. In addition to analyzing average phoneme duration, another analysis revealed that dysarthric speech contains crucial high-frequency spectral information. However, Mel-frequency cepstral coefficients (MFCC) are inherently designed to down-sample spectral information in the higher-frequency regions, and the same is true for Mel-filterbank features. To address this shortcoming, Linear-filterbank cepstral coefficients (LFCC) were used in combination with MFCC features. While MFCC effectively captures certain aspects of dysarthric speech, LFCC complements this by capturing high-frequency details essential for accurate dysarthric speaker verification. This proposed feature fusion effectively minimizes spectral information loss, further reducing error rates. To support the significance of combination of MFCC and LFCC features in an automatic speaker verification system for speakers with dysarthria, comprehensive experimentation was conducted. The fusion of MFCC and LFCC features was compared with several other front-end acoustic features, such as Mel-filterbank features, linear filterbank features, wavelet filterbank features, linear prediction cepstral coefficients (LPCC), frequency domain LPCC, and constant Q cepstral coefficients (CQCC). The approaches were evaluated using both <em>i</em>-vector and <em>x</em>-vector-based representation, comparing systems developed using MFCC and LFCC features individually and in combination. The experimental results presented in this paper demonstrate substantial improvements, with a 25.78% reduction in equal error rate (EER) for <em>i</em>-vector models and a 23.66% reduction in EER for <em>x</em>-vector models when compared to the baseline ASV system. Additionally, the effect of feature concatenation with variation in dysarthria severity levels (low, medium, and high) was studied, and the proposed approach was found to be highly effective in those cases as well.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"160 ","pages":"Article 103070"},"PeriodicalIF":3.2,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140555266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An ensemble technique to predict Parkinson's disease using machine learning algorithms","authors":"Nutan Singh, Priyanka Tripathi","doi":"10.1016/j.specom.2024.103067","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103067","url":null,"abstract":"<div><p>Parkinson's Disease (PD) is a progressive neurodegenerative disorder affecting motor and non-motor symptoms. Its symptoms develop slowly, making early identification difficult. Machine learning has a significant potential to predict Parkinson's disease on features hidden in voice data. This work aimed to identify the most relevant features from a high-dimensional dataset, which helps accurately classify Parkinson's Disease with less computation time. Three individual datasets with various medical features based on voice have been analyzed in this work. An Ensemble Feature Selection Algorithm (EFSA) technique based on filter, wrapper, and embedding algorithms that pick highly relevant features for identifying Parkinson's Disease is proposed, and the same has been validated on three different datasets based on voice. These techniques can shorten training time to improve model accuracy and minimize overfitting. We utilized different ML models such as K-Nearest Neighbors (KNN), Random Forest, Decision Tree, Support Vector Machine (SVM), Bagging Classifier, Multi-Layer Perceptron (MLP) Classifier, and Gradient Boosting. Each of these models was fine-tuned to ensure optimal performance within our specific context. Moreover, in addition to these established classifiers, we proposed an ensemble classifier is found on a high optimal majority of the votes. Dataset-I achieves classification accuracy with 97.6 %, F<sub>1</sub>-score 97.9 %, precision with 98 % and recall with 98 %. Dataset-II achieves classification accuracy 90.2 %, F<sub>1</sub>-score 90.2 %, precision 90.2 %, and recall 90.5 %. Dataset-III achieves 83.3 % accuracy, F<sub>1</sub>-score 83.3 %, precision 83.5 % and recall 83.3 %. These results have been taken using 13 out of 23, 45 out of 754, and 17 out of 46 features from respective datasets. The proposed EFSA model has performed with higher accuracy and is more efficient than other models for each dataset.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103067"},"PeriodicalIF":3.2,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140547363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multimodal model for predicting feedback position and type during conversation","authors":"Auriane Boudin , Roxane Bertrand , Stéphane Rauzy , Magalie Ochs , Philippe Blache","doi":"10.1016/j.specom.2024.103066","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103066","url":null,"abstract":"<div><p>This study investigates conversational feedback, that is, a listener's reaction in response to a speaker, a phenomenon which occurs in all natural interactions. Feedback depends on the main speaker's productions and in return supports the elaboration of the interaction. As a consequence, feedback production has a direct impact on the quality of the interaction.</p><p>This paper examines all types of feedback, from generic to specific feedback, the latter of which has received less attention in the literature. We also present a fine-grained labeling system introducing two sub-types of specific feedback: <em>positive/negative</em> and <em>given/new</em>. Following a literature review on linguistic and machine learning perspectives highlighting the main issues in feedback prediction, we present a model based on a set of multimodal features which predicts the possible position of feedback and its type. This computational model makes it possible to precisely identify the different features in the speaker's production (morpho-syntactic, prosodic and mimo-gestural) which play a role in triggering feedback from the listener; the model also evaluates their relative importance.</p><p>The main contribution of this study is twofold: we sought to improve 1/ the model's performance in comparison with other approaches relying on a small set of features, and 2/ the model's interpretability, in particular by investigating feature importance. By integrating all the different modalities as well as high-level features, our model is uniquely positioned to be applied to French corpora.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103066"},"PeriodicalIF":3.2,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000384/pdfft?md5=d3bb6a1d05cfbf539d30e718f252c2d8&pid=1-s2.0-S0167639324000384-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140331131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speech intelligibility prediction using generalized ESTOI with fine-tuned parameters","authors":"Szymon Drgas","doi":"10.1016/j.specom.2024.103068","DOIUrl":"https://doi.org/10.1016/j.specom.2024.103068","url":null,"abstract":"<div><p>In this article, a lightweight and interpretable speech intelligibility prediction network is proposed. It is based on the ESTOI metric with several extensions: learned modulation filterbank, temporal attention, and taking into account robustness of a given reference recording. The proposed network is differentiable, and therefore it can be applied as a loss function in speech enhancement systems. The method was evaluated using the Clarity Prediction Challenge dataset. Compared to MB-STOI, the best of the systems proposed in this paper reduced RMSE from 28.01 to 21.33. It also outperformed best performing systems from the Clarity Challenge, while its training does not require additional labels like speech enhancement system and talker. It also has small memory and requirements, therefore, it can be potentially used as a loss function to train speech enhancement system. As it would consume less resources, the saved ones can be used for a larger speech enhancement neural network.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103068"},"PeriodicalIF":3.2,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140540077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}