{"title":"SLU for Voice Command in Smart Home: Comparison of Pipeline and End-to-End Approaches","authors":"Thierry Desot, François Portet, Michel Vacher","doi":"10.1109/ASRU46091.2019.9003891","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003891","url":null,"abstract":"Spoken Language Understanding (SLU) is typically performed through automatic speech recognition (ASR) and natural language understanding (NLU) in a pipeline. However, errors at the ASR stage have a negative impact on the NLU performance. Hence, there is a rising interest in End-to-End (E2E) SLU to jointly perform ASR and NLU. Although E2E models have shown superior performance to modular approaches in many NLP tasks, current SLU E2E models have still not definitely superseded pipeline approaches. In this paper, we present a comparison of the pipeline and E2E approaches for the task of voice command in smart homes. Since there are no large non-English domain-specific data sets available, although needed for an E2E model, we tackle the lack of such data by combining Natural Language Generation (NLG) and text-to-speech (TTS) to generate French training data. The trained models were evaluated on voice commands acquired in a real smart home with several speakers. Results show that the E2E approach can reach performances similar to a state-of-the art pipeline SLU despite a higher WER than the pipeline approach. Furthermore, the E2E model can benefit from artificially generated data to exhibit lower Concept Error Rates than the pipeline baseline for slot recognition.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132155924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Monotonic Recurrent Neural Network Transducer and Decoding Strategies","authors":"Anshuman Tripathi, Han Lu, H. Sak, H. Soltau","doi":"10.1109/ASRU46091.2019.9003822","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003822","url":null,"abstract":"Recurrent Neural Network Transducer (RNNT) is an end-to-end model which transduces discrete input sequences to output sequences by learning alignments between the sequences. In speech recognition tasks we generally have a strictly monotonic alignment between time frames and label sequence. However, the standard RNNT loss does not enforce this constraint. This can cause some anomalies in alignments such as the model outputting a sequence of labels at a single time frame. There is also no bound on the decoding time steps. To address these problems, we introduce a monotonic version of the RNNT loss. Under the assumption that the output sequence is not longer than the input sequence, this loss can be used with forward-backward algorithm to learn strictly monotonic alignments between the sequences. We present experimental studies showing that speech recognition accuracy for monotonic RNNT is equivalent to standard RNNT. We also explore best-first and breadth-first decoding strategies for both monotonic and standard RNNT models. Our experiments show that breadth-first search is effective in exploring and combining alternative alignments. Additionally, it also allows batching of hypotheses during search label expansion, allowing better resource utilization, and resulting in decoding speedup.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132430047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Very Deep Convolutional Neural Networks to Automatically Detect Plagiarized Spoken Responses","authors":"Xinhao Wang, Keelan Evanini, Yao Qian, K. Zechner","doi":"10.1109/ASRU46091.2019.9003924","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003924","url":null,"abstract":"This study focuses on the automatic plagiarism detection in the context of high-stakes spoken language proficiency assessment, in which some test takers may attempt to game the test by memorizing prepared source materials before the test and then adapting them on-the-fly during the test to produce their spoken responses. When trying to identify such instances of plagiarism, experienced human raters attempt to find salient matching expressions that appear both in potential source materials and the test responses. This motivates an approach that visualizes a grid of lexical matches between a test response and a source and then applies state-of-the-art image recognition techniques to detect patterns of matching sequences. This study employs Inception networks-very deep convolutional neural networks-to build automatic detection models. The system achieves an F1-score of 79.6% on the class of plagiarized responses outperforming a baseline system based on word sequence matching (F1-score of 74.1%).","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125534135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Controlling False Alarm - Miss Trade-Off in Perceptual Speaker Comparison via Non-Neutral Listening Task Framing","authors":"Rosa González Hautamäki, T. Kinnunen","doi":"10.1109/ASRU46091.2019.9003978","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003978","url":null,"abstract":"Speaker comparison by listening is a valuable resource, for instance, in human voice discrimination studies, and voice conversion (VC) systems evaluations. Usually, listeners are provided with application-neutral guidelines that encourage retaining overall high speaker discrimination accuracy. Nonetheless, listeners are subject to misses (declaring same-speaker trial as different-speaker) and false alarms (vice versa) with possibly non-symmetric outcomes. In automatic speaker verification (ASV) applications, the consequences of a miss and a false alarm are rarely equal, and decision making policy is adjusted towards a given application with a desired miss/false alarm trade-off. We study whether listener decisions could similarly be controlled to provoke more accept (or reject) decisions, by framing the voice comparison task in different ways. Our neutral, forensic, user-convenient bank and secure bank scenarios are played by disjoint panels (through Amazon's Mechanical Turk), all judging the same speaker trials originated from RedDots and 2018 Voice Conversion Challenge (VCC 2018) data. Our results indicate that listener decisions can be influenced by modifying the task framing. As a subjective task, the challenge is how to drive the panel decisions to the desired direction (to reduce miss or false alarm rate). Our preliminary results suggest potential for novel, application-directed speaker discrimination designs.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115035978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spatio-Temporal Context Modelling for Speech Emotion Classification","authors":"Md. Asif Jalal, Roger K. Moore, Thomas Hain","doi":"10.1109/ASRU46091.2019.9004037","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004037","url":null,"abstract":"Speech emotion recognition (SER) is a requisite for emotional intelligence that affects the understanding of speech. One of the most crucial tasks is to obtain patterns having a maximum correlation for the emotion classification task from the speech signal while being invariant to the changes in frequency, time and other external distortions. Therefore, learning emotional contextual feature representation independent of speaker and environment is essential. In this paper, a novel spatiotemporal context modelling framework for robust SER is proposed to learn feature representation by using acoustic context expansion with high dimensional feature projection. The framework uses a deep convolutional neural network (CNN) and self-attention network. The CNNs combine spatiotemporal features. The attention network produces high dimensional task-specific features and combines these features for context modelling, which altogether provides a state-of-the-art technique for classifying the extracted patterns for speech emotion. Speech emotion is a categorical perception representing discrete sensory events. The proposed approach is compared with a wide range of architectures on the RAVDESS and IEMOCAP corpora for 8-class and 4-class emotion classification tasks and remarkable gain over state-of-the-art systems are obtained, absolutely 15%, 10% respectively.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"192 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116783173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Comparative Study on End-to-End Speech to Text Translation","authors":"Parnia Bahar, Tobias Bieschke, H. Ney","doi":"10.1109/ASRU46091.2019.9003774","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003774","url":null,"abstract":"Recent advances in deep learning show that end-to-end speech to text translation model is a promising approach to direct the speech translation field. In this work, we provide an overview of different end-to-end architectures, as well as the usage of an auxiliary connectionist temporal classification (CTC) loss for better convergence. We also investigate on pre-training variants such as initializing different components of a model using pretrained models, and their impact on the final performance, which gives boosts up to 4% in Bleu and 5% in Ter. Our experiments are performed on 270h IWSLT TED-talks En→De, and 100h LibriSpeech Audio-books En→Fr. We also show improvements over the current end-to-end state-of-the-art systems on both tasks.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116461651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Takaki Makino, H. Liao, Yannis Assael, Brendan Shillingford, Basi García, Otavio Braga, O. Siohan
{"title":"Recurrent Neural Network Transducer for Audio-Visual Speech Recognition","authors":"Takaki Makino, H. Liao, Yannis Assael, Brendan Shillingford, Basi García, Otavio Braga, O. Siohan","doi":"10.1109/ASRU46091.2019.9004036","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004036","url":null,"abstract":"This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture. To support the development of such a system, we built a large audio-visual (A/V) dataset of segmented utterances extracted from YouTube public videos, leading to 31k hours of audio-visual training content. The performance of an audio-only, visual-only, and audio-visual system are compared on two large-vocabulary test sets: a set of utterance segments from public YouTube videos called YTDEV18 and the publicly available LRS3-TED set. To highlight the contribution of the visual modality, we also evaluated the performance of our system on the YTDEV18 set artificially corrupted with background noise and overlapping speech. To the best of our knowledge, our system significantly improves the state-of-the-art on the LRS3-TED set.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"186 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123047643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Chiu, Wei Han, Yu Zhang, Ruoming Pang, S. Kishchenko, Patrick Nguyen, A. Narayanan, H. Liao, Shuyuan Zhang, Anjuli Kannan, Rohit Prabhavalkar, Z. Chen, Tara N. Sainath, Yonghui Wu
{"title":"A Comparison of End-to-End Models for Long-Form Speech Recognition","authors":"C. Chiu, Wei Han, Yu Zhang, Ruoming Pang, S. Kishchenko, Patrick Nguyen, A. Narayanan, H. Liao, Shuyuan Zhang, Anjuli Kannan, Rohit Prabhavalkar, Z. Chen, Tara N. Sainath, Yonghui Wu","doi":"10.1109/ASRU46091.2019.9003854","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003854","url":null,"abstract":"End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems [1], [2]. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attention-based end-to-end models can be very competitive to RNN-T on long-form speech recognition.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"187 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114195088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. S. Nidadavolu, Saurabh Kataria, J. Villalba, N. Dehak
{"title":"Low-Resource Domain Adaptation for Speaker Recognition Using Cycle-Gans","authors":"P. S. Nidadavolu, Saurabh Kataria, J. Villalba, N. Dehak","doi":"10.1109/ASRU46091.2019.9003748","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003748","url":null,"abstract":"Current speaker recognition technology provides great performance with the x-vector approach. However, performance decreases when the evaluation domain is different from the training domain, an issue usually addressed with domain adaptation approaches. Recently, unsupervised domain adaptation using cycle-consistent Generative Adversarial Networks (CycleGAN) has received a lot of attention. Cycle-GAN learn mappings between features of two domains given non-parallel data. We investigate their effectiveness in low resource scenario i.e. when limited amount of target domain data is available for adaptation, a case unexplored in previous works. We experiment with two adaptation tasks: microphone to telephone and a novel reverberant to clean adaptation with the end goal of improving speaker recognition performance. Number of speakers present in source and target domains are 7000 and 191 respectively. By adding noise to the target domain during CycleGAN training, we were able to achieve better performance compared to the adaptation system whose CycleGAN was trained on a larger target data. On reverberant to clean adaptation task, our models improved EER by 18.3% relative on VOiCES dataset compared to a system trained on clean data. They also slightly improved over the state-of-the-art Weighted Prediction Error (WPE) de-reverberation algorithm.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131165040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Narayanan, Rohit Prabhavalkar, C. Chiu, David Rybach, Tara N. Sainath, Trevor Strohman
{"title":"Recognizing Long-Form Speech Using Streaming End-to-End Models","authors":"A. Narayanan, Rohit Prabhavalkar, C. Chiu, David Rybach, Tara N. Sainath, Trevor Strohman","doi":"10.1109/ASRU46091.2019.9003913","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003913","url":null,"abstract":"All-neural end-to-end (E2E) automatic speech recognition (ASR) systems that use a single neural network to transduce audio to word sequences have been shown to achieve state-of-the-art results on several tasks. In this work, we examine the ability of E2E models to generalize to unseen domains, where we find that models trained on short utterances fail to generalize to long-form speech. We propose two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances. On a synthesized long-form test set, adding data diversity improves word error rate (WER) by 90% relative, while simulating long-form training improves it by 67% relative, though the combination doesn't improve over data diversity alone. On a real long-form call-center test set, adding data diversity improves WER by 40% relative. Simulating long-form training on top of data diversity improves performance by an additional 27% relative.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125361187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}