{"title":"Training Language Models for Long-Span Cross-Sentence Evaluation","authors":"Kazuki Irie, Albert Zeyer, R. Schlüter, H. Ney","doi":"10.1109/ASRU46091.2019.9003788","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003788","url":null,"abstract":"While recurrent neural networks can motivate cross-sentence language modeling and its application to automatic speech recognition (ASR), corresponding modifications of the training method for that end are rarely discussed. In fact, even more generally, the impact of training sequence construction strategy in language modeling for different evaluation conditions is typically ignored. In this work, we revisit this basic but fundamental question. We train language models based on long short-term memory recurrent neural networks and Transformers using various types of training sequences and study their robustness with respect to different evaluation modes. Our experiments on 300h Switchboard and Quaero English datasets show that models trained with back-propagation over sequences consisting of concatenation of multiple sentences with state carry-over across sequences effectively outperform those trained with the sentence-level training, both in terms of perplexity and word error rates for cross-utterance ASR.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131524976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Novel Enhanced Teager Energy Based Cepstral Coefficients for Replay Spoof Detection","authors":"R. Acharya, H. Patil, Harsh Kotta","doi":"10.1109/ASRU46091.2019.9003934","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003934","url":null,"abstract":"Replay attack on voice biometric, refers to the fraudulent attempt made by an imposter to spoof another person's identity by replaying the pre-recorded voice samples in front of an Automatic Speaker Verification (ASV) system. In an attempt to develop countermeasures against replay attack, this paper proposes to use a new feature set, namely, Enhanced Teager Energy Cepstral Coefficients (ETECC) using the recently introduced concept of signal mass. Results obtained on ASVspoof 2017 version 2.0 dataset suggest that the proposed feature set performs better than the original Teager Energy Cepstral Coefficients (TECC) feature set because the Enhanced Teager Energy Operator (ETEO) gives a better estimate of signal's energy as compared to the Teager Energy Operator (TEO). We obtained 53.3% and 51.35% reduction in EER on development and evaluation dataset, respectively, with respect to the baseline system.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133339630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peter William VanHarn Plantinga, E. Fosler-Lussier
{"title":"Towards Real-Time Mispronunciation Detection in Kids' Speech","authors":"Peter William VanHarn Plantinga, E. Fosler-Lussier","doi":"10.1109/ASRU46091.2019.9003863","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003863","url":null,"abstract":"Modern mispronunciation detection and diagnosis systems have seen significant gains in accuracy due to the introduction of deep learning. However, these systems have not been evaluated for the ability to be run in real-time, an important factor in applications that provide rapid feedback. In particular, the state-of-the-art uses bi-directional recurrent networks, where a uni-directional network may be more appropriate. Teacher-student learning is a natural approach to use to improve a uni-directional model, but when using a CTC objective, this is limited by poor alignment of outputs to evidence. We address this limitation by trying two loss terms for improving the alignments of our models. One loss is an “alignment loss” term that encourages outputs only when features do not resemble silence. The other loss term uses a uni-directional model as teacher model to align the bi-directional model. Our proposed model uses these aligned bi-directional models as teacher models. Experiments on the CSLU kids' corpus show that these changes decrease the latency of the outputs, and improve the detection rates, with a trade-off between these goals.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130750236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition","authors":"Zhong Meng, Jinyu Li, Yashesh Gaur, Y. Gong","doi":"10.1109/ASRU46091.2019.9003776","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003776","url":null,"abstract":"Teacher-student (T/S) has shown to be effective for domain adaptation of deep neural network acoustic models in hybrid speech recognition systems. In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance. To further improve T/S learning with the help of ground-truth labels, we propose adaptive T/S (AT/S) learning. Instead of conditionally choosing from either the teacher's soft token posteriors or the one-hot ground-truth label, in AT/S, the student always learns from both the teacher and the ground truth with a pair of adaptive weights assigned to the soft and one-hot labels quantifying the confidence on each of the knowledge sources. The confidence scores are dynamically estimated at each decoder step as a function of the soft and one-hot labels. With 3400 hours parallel close-talk and far-field Microsoft Cortana data for domain adaptation, T/S and AT/S achieves 6.3% and 10.3% relative word error rate improvement over a strong E2E model trained with the same amount of far-field data.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132311993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Between Different Teacher and Student Models in ASR","authors":"J. H. M. Wong, M. Gales, Yu Wang","doi":"10.1109/ASRU46091.2019.9003756","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003756","url":null,"abstract":"Teacher-student learning can be applied in automatic speech recognition for model compression and domain adaptation. This trains a student model to emulate the behaviour of a teacher model, and only the student is used to perform recognition. Depending on the application, the teacher and student may differ in their model types, complexities, input contexts, and input features. In previous works, it is often shown that learning from a strong teacher allows the student to perform better than an equivalent model trained with only the reference transcriptions. However, there has not been much investigation into whether a particular form of teacher is appropriate for the student to learn from. This paper aims to study how effectively the student is able to learn from the teacher, when differences exist between their designs. The Augmented Multi-party Interaction (AMI) meeting transcription and Multi-Genre Broadcast (MGB-3) television broadcast audio tasks are used in this analysis. Experimental results suggest that a student can effectively learn from a more complex teacher, but may struggle when it lacks input information. It is therefore important to carefully consider the design of the student for each application.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120965225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Markov Recurrent Neural Network Language Model","authors":"Jen-Tzung Chien, Che-Yu Kuo","doi":"10.1109/ASRU46091.2019.9003850","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003850","url":null,"abstract":"Recurrent neural network (RNN) has achieved a great success in language modeling where the temporal information based on deterministic state is continuously extracted and evolved through time. Such a simple deterministic transition function using input-to-hidden and hidden-to-hidden weights is usually insufficient to reflect the diversities and variations of latent variable structure behind the heterogeneous natural language. This paper presents a new stochastic Markov RNN (MRNN) to strengthen the learning capability in language model where the trajectory of word sequences is driven by a neural Markov process with Markov state transitions based on a K-state long short-term memory model. A latent state machine is constructed to characterize the complicated semantics in the structured lexical patterns. Gumbel-softmax is introduced to implement the stochastic backpropatation algorithm with discrete states. The parallel computation for rapid realization of MRNN is presented. The variational Bayesian learning procedure is implemented. Experiments demonstrate the merits of stochastic and diverse representation using MRNN language model where the overhead of parameters and computations is limited.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123583424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Neural Network Embeddings Using a Pair-Wise Loss for Text-Independent Speaker Verification","authors":"Hira Dhamyal, Tianyan Zhou, B. Raj, Rita Singh","doi":"10.1109/ASRU46091.2019.9003794","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003794","url":null,"abstract":"This paper proposes a new loss function called the “quartet” loss for the better optimization of the neural networks for matching tasks. For such tasks, where neural network embeddings are the key component, the optimization of the network for better embeddings is critical. The embeddings are required to be class discriminative, resulting in minimal inter-class variation and maximal intra-class variation even for unseen classes for better generalization of the network. The quartet loss explicitly computes the distance metric between pairs of inputs and increases the gap between the similarity score distributions between the same class pairs and the different class pairs. We evaluate on the speaker verification task and demonstrate the performance of the loss on our proposed neural network.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125026657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shang-Bao Luo, Hung-Shin Lee, Kuan-Yu Chen, H. Wang
{"title":"Spoken Multiple-Choice Question Answering Using Multimodal Convolutional Neural Networks","authors":"Shang-Bao Luo, Hung-Shin Lee, Kuan-Yu Chen, H. Wang","doi":"10.1109/ASRU46091.2019.9003966","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003966","url":null,"abstract":"In a spoken multiple-choice question answering (MCQA) task, where passages, questions, and choices are given in the form of speech, usually only the auto-transcribed text is considered in system development. The acoustic-level information may contain useful cues for answer prediction. However, to the best of our knowledge, only a few studies focus on using the acoustic-level information or fusing the acoustic-level information with the text-level information for a spoken MCQA task. Therefore, this paper presents a hierarchical multistage multimodal (HMM) framework based on convolutional neural networks (CNNs) to integrate text- and acoustic-level statistics into neural modeling for spoken MCQA. Specifically, the acoustic-level statistics are expected to offset text inaccuracies caused by automatic speech recognition (ASR) systems or representation inadequacy lurking in word embedding generators, thereby making the spoken MCQA system robust. In the proposed HMM framework, two modalities are first manipulated to separately derive the acoustic- and text-level representations for the passage, question, and choices. Next, these clever features are jointly involved in inferring the relationships among the passage, question, and choices. Then, a final representation is derived for each choice, which encodes the relationship of the choice to the passage and question. Finally, the most likely answer is determined based on the individual final representations of all choices. Evaluated on the data of “Formosa Grand Challenge - Talk to AI”, a Mandarin Chinese spoken MCQA contest held in 2018, the proposed HMM framework achieves remarkable improvements in accuracy over the text-only baseline.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114764956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Dropout-Based Single Model Committee Approach for Active Learning in ASR","authors":"Jiayi Fu, Kuang Ru","doi":"10.1109/ASRU46091.2019.9003728","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003728","url":null,"abstract":"In this paper, we proposed a new committee-based approach for active learning (AL) in automatic speech recognition (ASR). This approach can achieve lower recognition word error rate (WER) with fewer transcription by selecting the most informative samples. Different from previous committee-based AL approaches, the committee construction process of this approach needs to train only one acoustic model(AM) with dropout. Since only one model needs to be trained, this approach is simpler and faster. At the same time, the AM will be improved continuously, we also found this approach is more robust to its improvement. In experiments, we compared our approach with the random sampling and another state-of-the-art committee-based approach: heterogeneous neural networks (HNN) based approach. We examined our approach in WER, the time to construct committee and the robustness of model improvement in the Mandarin ASR task with 1600 hours speech data. The results showed that it achieves 2–3 times relative WER reduction compare with the random sampling, and it only uses 75% the time to achieve close WER with HNN-based approach.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125717107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transfer Learning for Context-Aware Spoken Language Understanding","authors":"Qian Chen, Zhu Zhuo, Wen Wang, Qiuyun Xu","doi":"10.1109/ASRU46091.2019.9003902","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003902","url":null,"abstract":"Spoken language understanding (SLU) is a key component of task-oriented dialogue systems. SLU parses natural language user utterances into semantic frames. Previous work has shown that incorporating context information significantly improves SLU performance for multi-turn dialogues. However, collecting a large-scale human-labeled multi-turn dialogue corpus for the target domains is complex and costly. To reduce dependency on the collection and annotation effort, we propose a Context Encoding Language Transformer (CELT) model facilitating exploiting various context information for SLU. We explore different transfer learning approaches to reduce dependency on data collection and annotation. In addition to unsupervised pre-training using large-scale general purpose unlabeled corpora, such as Wikipedia, we explore unsupervised and supervised adaptive training approaches for transfer learning to benefit from other in-domain and out-of-domain dialogue corpora. Experimental results demonstrate that the proposed model with the proposed transfer learning approaches achieves significant improvement on the SLU performance over state-of-the-art models on two large-scale single-turn dialogue benchmarks and one large-scale multi-turn dialogue benchmark.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126833518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}