Liangqi Liu, Jiankun Hu, Zhiyong Wu, Song Yang, Songfan Yang, Jia Jia, H. Meng
{"title":"Controllable Emphatic Speech Synthesis based on Forward Attention for Expressive Speech Synthesis","authors":"Liangqi Liu, Jiankun Hu, Zhiyong Wu, Song Yang, Songfan Yang, Jia Jia, H. Meng","doi":"10.1109/SLT48900.2021.9383537","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383537","url":null,"abstract":"In speech interaction scenarios, speech emphasis is essential for expressing the underlying intention and attitude. Recently, end-to-end emphatic speech synthesis greatly improves the naturalness of synthetic speech, but also brings new problems: 1) lack of interpretability for how emphatic codes affect the model; 2) no separate control of emphasis on duration and on intonation and energy. We propose a novel way to build an interpretable and controllable emphatic speech synthesis framework based on forward attention. Firstly, we explicitly model the local variation of speaking rate for emphasized words and neutral words with modified forward attention to manifest emphasized words in terms of duration. The 2-layers LSTM in decoder is further divided into attention-RNN and decoder-RNN to disentangle the influence of emphasis on duration and on intonation and energy. The emphasis information is injected into decoder-RNN for highlighting emphasized words in the aspects of intonation and energy. Experimental results have shown that our model can not only provide separate control of emphasis on duration and on intonation and energy, but also generate more robust and prominent emphatic speech with high quality and naturalness.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129034703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Praveen, Abhishek Pandey, D. Kumar, S. Rath, Sandip Shriram Bapat
{"title":"Dynamically Weighted Ensemble Models for Automatic Speech Recognition","authors":"K. Praveen, Abhishek Pandey, D. Kumar, S. Rath, Sandip Shriram Bapat","doi":"10.1109/SLT48900.2021.9383463","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383463","url":null,"abstract":"In machine learning, training multiple models for the same task, and using the outputs from all the models helps reduce the variance of the combined result. Using an ensemble of models in classification tasks such as Automatic Speech Recognition (ASR) improves the accuracy across different target domains such as multiple accents, environmental conditions, and other scenarios. It is possible to select model weights for the ensemble in numerous ways. A classifier trained to identify target domain, a simple averaging function, or an exhaustive grid search are the common approaches to obtain suitable weights. All these methods suffer either in choosing sub-optimal weights or by being computationally expensive. We propose a novel and practical method for dynamic weight selection in an ensemble, which can approximate a grid search in a time-efficient manner. We show that a combination of weights always performs better than assigning uniform weights for all models. Our algorithm can utilize a validation set if available or find weights dynamically from the input utterance itself. Experiments conducted for various ASR tasks show that the proposed method outperforms the uniformly weighted ensemble in terms of Word Error Rate (WER) in our experiments.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133612322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lightspeech: Lightweight Non-Autoregressive Multi-Speaker Text-To-Speech","authors":"Song Li, Beibei Ouyang, Lin Li, Q. Hong","doi":"10.1109/SLT48900.2021.9383562","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383562","url":null,"abstract":"With the development of deep learning, end-to-end neural text-to-speech systems have achieved significant improvements on high-quality speech synthesis. However, most of these systems are attention-based autoregressive models, resulting in slow synthesis speed and large model parameters. In this paper, we propose a new lightweight non-autoregressive multi-speaker speech synthesis system, named LightSpeech, which utilizes the lightweight feedforward neural networks to accelerate synthesis and reduce the amount of parameters. With the speaker embedding, LightSpeech achieves multi-speaker speech synthesis extremely quickly. Experiments on the LibriTTS dataset show that, compared with FastSpeech, our smallest LightSpeech model achieves a 9.27x Mel-spectrogram generation acceleration on CPU, and the model size and parameters are compressed by 37.06x and 37.36x, respectively.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131643014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Unsupervised Learning of Speech Features in the Wild","authors":"M. Rivière, Emmanuel Dupoux","doi":"10.1109/SLT48900.2021.9383461","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383461","url":null,"abstract":"Recent work on unsupervised contrastive learning of speech representation has shown promising results, but so far has mostly been applied to clean, curated speech datasets. Can it also be used with unprepared audio data \"in the wild\"? Here, we explore three potential problems in this setting: (i) presence of non-speech data, (ii) noisy or low quality speech data, and (iii) imbalance in speaker distribution. We show that on the Libri-light train set, which is itself a relatively clean speech-only dataset, these problems combined can already have a performance cost of up to 30% relative for the ABX score. We show that the first two problems can be alleviated by data filtering, with voice activity detection selecting speech segments, while perplexity of a model trained with clean data helping to discard entire files. We show that the third problem can be alleviated by learning a speaker embedding in the predictive branch of the model. We show that these techniques build more robust speech features that can be transferred to an ASR task in the low resource setting.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"307 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116274791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandru-Lucian Georgescu, Cristian Manolache, Dan Oneaţă, H. Cucu, C. Burileanu
{"title":"Data-Filtering Methods for Self-Training of Automatic Speech Recognition Systems","authors":"Alexandru-Lucian Georgescu, Cristian Manolache, Dan Oneaţă, H. Cucu, C. Burileanu","doi":"10.1109/SLT48900.2021.9383577","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383577","url":null,"abstract":"Self-training is a simple and efficient way of leveraging un-labeled speech data: (i) start with a seed system trained on transcribed speech; (ii) pass the unlabeled data through this seed system to automatically generate transcriptions; (iii) en-large the initial dataset with the self-labeled data and retrain the speech recognition system. However, in order not to pol-lute the augmented dataset with incorrect transcriptions, an important intermediary step is to select those parts of the self-labeled data that are accurate. Several approaches have been proposed in the community, but most of the works address only a single method. In contrast, in this paper we inspect three distinct classes of data-filtering for self-training, leveraging: (i) confidence scores, (ii) multiple ASR hypotheses and (iii) approximate transcriptions. We evaluate these approaches from two perspectives: quantity vs. quality of the selected data and improvement of the seed ASR by including this data. The proposed methodology achieves state-of-the-art results on Romanian speech, obtaining 25% relative improvement over prior work. Among the three methods, approximate transcriptions bring the highest performance gain, even if they yield the least quantity of data.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122806318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenda Li, Yi Luo, Cong Han, Jinyu Li, Takuya Yoshioka, Tianyan Zhou, Marc Delcroix, K. Kinoshita, Christoph Böddeker, Y. Qian, Shinji Watanabe, Zhuo Chen
{"title":"Dual-Path RNN for Long Recording Speech Separation","authors":"Chenda Li, Yi Luo, Cong Han, Jinyu Li, Takuya Yoshioka, Tianyan Zhou, Marc Delcroix, K. Kinoshita, Christoph Böddeker, Y. Qian, Shinji Watanabe, Zhuo Chen","doi":"10.1109/SLT48900.2021.9383514","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383514","url":null,"abstract":"Continuous speech separation (CSS) is an arising task in speech separation aiming at separating overlap-free targets from a long, partially-overlapped recording. A straightforward extension of previously proposed sentence-level separation models to this task is to segment the long recording into fixed-length blocks and perform separation on them independently. However, such simple extension does not fully address the cross-block dependencies and the separation performance may not be satisfactory. In this paper, we focus on how the block-level separation performance can be improved by exploring methods to utilize the cross-block information. Based on the recently proposed dual-path RNN (DPRNN) architecture, we investigate how DPRNN can help the block-level separation by the interleaved intra- and inter-block modules. Experiment results show that DPRNN is able to significantly outperform the baseline block-level model in both offline and block-online configurations under certain settings.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131680420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"End-to-End Speaker Diarization Conditioned on Speech Activity and Overlap Detection","authors":"Yuki Takashima, Yusuke Fujita, Shinji Watanabe, Shota Horiguchi, Leibny Paola García-Perera, Kenji Nagamatsu","doi":"10.1109/SLT48900.2021.9383555","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383555","url":null,"abstract":"In this paper, we present a conditional multitask learning method for end-to-end neural speaker diarization (EEND). The EEND system has shown promising performance compared with traditional clustering-based methods, especially in the case of overlapping speech. In this paper, to further improve the performance of the EEND system, we propose a novel multitask learning framework that solves speaker diarization and a desired subtask while explicitly considering the task dependency. We optimize speaker diarization conditioned on speech activity and overlap detection that are subtasks of speaker diarization, based on the probabilistic chain rule. Experimental results show that our proposed method can leverage a subtask to effectively model speaker diarization, and outperforms conventional EEND systems in terms of diarization error rate.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"91 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131208016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain Generalization with Triplet Network for Cross-Corpus Speech Emotion Recognition","authors":"Shi-wook Lee","doi":"10.1109/SLT48900.2021.9383534","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383534","url":null,"abstract":"Domain generalization is a major challenge for cross-corpus speech emotion recognition. The recognition performance built on \"seen\" source corpora is inevitably degraded when the systems are tested against \"unseen\" target corpora that have different speakers, channels, and languages. We present a novel framework based on a triplet network to learn more generalized features of emotional speech that are invariant across multiple corpora. To reduce the intrinsic discrepancies between source and target corpora, an explicit feature transformation based on the triplet network is implemented as a preprocessing step. Extensive comparison experiments are carried out on three emotional speech corpora; two English corpora, and one Japanese corpus. Remarkable improvements of up-to 35.61% are achieved for all cross-corpus speech emotion recognition, and we show that the proposed framework using the triplet network is effective for obtaining more generalized features across multiple emotional speech corpora.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130745773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Investigation into the Multi-channel Time Domain Speaker Extraction Network","authors":"Catalin Zorila, Mohan Li, R. Doddipatla","doi":"10.1109/SLT48900.2021.9383582","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383582","url":null,"abstract":"This paper presents an investigation into the effectiveness of spatial features for improving time-domain speaker extraction systems. A two-dimensional Convolutional Neural Network (CNN) based encoder is proposed to capture the spatial information within the multichannel input, which are then combined with the spectral features of a single channel extraction network. Two variants of target speaker extraction methods were tested, one which employs a pre-trained i-vector system to compute a speaker embedding (System A), and one which employs a jointly trained neural network to extract the embeddings directly from time domain enrolment signals (System B). The evaluation was performed on the spatialized WSJ0-2mix dataset using the Signal-to-Distortion Ratio (SDR) metric, and ASR accuracy. In the anechoic condition, more than 10 dB and 7 dB absolute SDR gains were achieved when the 2-D CNN spatial encoder features were included with Systems A and B, respectively. The performance gains in reverberation were lower, however, we have demonstrated that retraining the systems by applying dereverberation preprocessing can significantly boost both the target speaker extraction and ASR performances.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130847315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}