2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献_第6页

Training Language Models for Long-Span Cross-Sentence Evaluation 大跨度跨句评价的语言模型训练

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003788

Kazuki Irie, Albert Zeyer, R. Schlüter, H. Ney

引用次数: 38

Novel Enhanced Teager Energy Based Cepstral Coefficients for Replay Spoof Detection 用于重放欺骗检测的新型增强Teager能量倒谱系数

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003934

R. Acharya, H. Patil, Harsh Kotta

引用次数: 4

Towards Real-Time Mispronunciation Detection in Kids' Speech 儿童语音中的语音错误实时检测

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003863

Peter William VanHarn Plantinga, E. Fosler-Lussier

{"title":"Towards Real-Time Mispronunciation Detection in Kids' Speech","authors":"Peter William VanHarn Plantinga, E. Fosler-Lussier","doi":"10.1109/ASRU46091.2019.9003863","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003863","url":null,"abstract":"Modern mispronunciation detection and diagnosis systems have seen significant gains in accuracy due to the introduction of deep learning. However, these systems have not been evaluated for the ability to be run in real-time, an important factor in applications that provide rapid feedback. In particular, the state-of-the-art uses bi-directional recurrent networks, where a uni-directional network may be more appropriate. Teacher-student learning is a natural approach to use to improve a uni-directional model, but when using a CTC objective, this is limited by poor alignment of outputs to evidence. We address this limitation by trying two loss terms for improving the alignments of our models. One loss is an “alignment loss” term that encourages outputs only when features do not resemble silence. The other loss term uses a uni-directional model as teacher model to align the bi-directional model. Our proposed model uses these aligned bi-directional models as teacher models. Experiments on the CSLU kids' corpus show that these changes decrease the latency of the outputs, and improve the detection rates, with a trade-off between these goals.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130750236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition 基于师生学习的端到端语音识别领域自适应

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003776

Zhong Meng, Jinyu Li, Yashesh Gaur, Y. Gong

{"title":"Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition","authors":"Zhong Meng, Jinyu Li, Yashesh Gaur, Y. Gong","doi":"10.1109/ASRU46091.2019.9003776","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003776","url":null,"abstract":"Teacher-student (T/S) has shown to be effective for domain adaptation of deep neural network acoustic models in hybrid speech recognition systems. In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance. To further improve T/S learning with the help of ground-truth labels, we propose adaptive T/S (AT/S) learning. Instead of conditionally choosing from either the teacher's soft token posteriors or the one-hot ground-truth label, in AT/S, the student always learns from both the teacher and the ground truth with a pair of adaptive weights assigned to the soft and one-hot labels quantifying the confidence on each of the knowledge sources. The confidence scores are dynamically estimated at each decoder step as a function of the soft and one-hot labels. With 3400 hours parallel close-talk and far-field Microsoft Cortana data for domain adaptation, T/S and AT/S achieves 6.3% and 10.3% relative word error rate improvement over a strong E2E model trained with the same amount of far-field data.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132311993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Learning Between Different Teacher and Student Models in ASR ASR中不同师生模式之间的学习

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003756

J. H. M. Wong, M. Gales, Yu Wang

{"title":"Learning Between Different Teacher and Student Models in ASR","authors":"J. H. M. Wong, M. Gales, Yu Wang","doi":"10.1109/ASRU46091.2019.9003756","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003756","url":null,"abstract":"Teacher-student learning can be applied in automatic speech recognition for model compression and domain adaptation. This trains a student model to emulate the behaviour of a teacher model, and only the student is used to perform recognition. Depending on the application, the teacher and student may differ in their model types, complexities, input contexts, and input features. In previous works, it is often shown that learning from a strong teacher allows the student to perform better than an equivalent model trained with only the reference transcriptions. However, there has not been much investigation into whether a particular form of teacher is appropriate for the student to learn from. This paper aims to study how effectively the student is able to learn from the teacher, when differences exist between their designs. The Augmented Multi-party Interaction (AMI) meeting transcription and Multi-Genre Broadcast (MGB-3) television broadcast audio tasks are used in this analysis. Experimental results suggest that a student can effectively learn from a more complex teacher, but may struggle when it lacks input information. It is therefore important to carefully consider the design of the student for each application.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120965225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Markov Recurrent Neural Network Language Model 马尔科夫递归神经网络语言模型

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003850

Jen-Tzung Chien, Che-Yu Kuo

{"title":"Markov Recurrent Neural Network Language Model","authors":"Jen-Tzung Chien, Che-Yu Kuo","doi":"10.1109/ASRU46091.2019.9003850","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003850","url":null,"abstract":"Recurrent neural network (RNN) has achieved a great success in language modeling where the temporal information based on deterministic state is continuously extracted and evolved through time. Such a simple deterministic transition function using input-to-hidden and hidden-to-hidden weights is usually insufficient to reflect the diversities and variations of latent variable structure behind the heterogeneous natural language. This paper presents a new stochastic Markov RNN (MRNN) to strengthen the learning capability in language model where the trajectory of word sequences is driven by a neural Markov process with Markov state transitions based on a K-state long short-term memory model. A latent state machine is constructed to characterize the complicated semantics in the structured lexical patterns. Gumbel-softmax is introduced to implement the stochastic backpropatation algorithm with discrete states. The parallel computation for rapid realization of MRNN is presented. The variational Bayesian learning procedure is implemented. Experiments demonstrate the merits of stochastic and diverse representation using MRNN language model where the overhead of parameters and computations is limited.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123583424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Optimizing Neural Network Embeddings Using a Pair-Wise Loss for Text-Independent Speaker Verification 基于对损失的文本无关说话人验证神经网络嵌入优化

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003794

Hira Dhamyal, Tianyan Zhou, B. Raj, Rita Singh

引用次数: 6

Spoken Multiple-Choice Question Answering Using Multimodal Convolutional Neural Networks 使用多模态卷积神经网络回答口语选择题

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003966

Shang-Bao Luo, Hung-Shin Lee, Kuan-Yu Chen, H. Wang

{"title":"Spoken Multiple-Choice Question Answering Using Multimodal Convolutional Neural Networks","authors":"Shang-Bao Luo, Hung-Shin Lee, Kuan-Yu Chen, H. Wang","doi":"10.1109/ASRU46091.2019.9003966","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003966","url":null,"abstract":"In a spoken multiple-choice question answering (MCQA) task, where passages, questions, and choices are given in the form of speech, usually only the auto-transcribed text is considered in system development. The acoustic-level information may contain useful cues for answer prediction. However, to the best of our knowledge, only a few studies focus on using the acoustic-level information or fusing the acoustic-level information with the text-level information for a spoken MCQA task. Therefore, this paper presents a hierarchical multistage multimodal (HMM) framework based on convolutional neural networks (CNNs) to integrate text- and acoustic-level statistics into neural modeling for spoken MCQA. Specifically, the acoustic-level statistics are expected to offset text inaccuracies caused by automatic speech recognition (ASR) systems or representation inadequacy lurking in word embedding generators, thereby making the spoken MCQA system robust. In the proposed HMM framework, two modalities are first manipulated to separately derive the acoustic- and text-level representations for the passage, question, and choices. Next, these clever features are jointly involved in inferring the relationships among the passage, question, and choices. Then, a final representation is derived for each choice, which encodes the relationship of the choice to the passage and question. Finally, the most likely answer is determined based on the individual final representations of all choices. Evaluated on the data of “Formosa Grand Challenge - Talk to AI”, a Mandarin Chinese spoken MCQA contest held in 2018, the proposed HMM framework achieves remarkable improvements in accuracy over the text-only baseline.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114764956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Dropout-Based Single Model Committee Approach for Active Learning in ASR 基于辍学的ASR主动学习单模型委员会方法

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003728

Jiayi Fu, Kuang Ru

引用次数: 0

Transfer Learning for Context-Aware Spoken Language Understanding 语境感知口语理解的迁移学习

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003902

Qian Chen, Zhu Zhuo, Wen Wang, Qiuyun Xu

{"title":"Transfer Learning for Context-Aware Spoken Language Understanding","authors":"Qian Chen, Zhu Zhuo, Wen Wang, Qiuyun Xu","doi":"10.1109/ASRU46091.2019.9003902","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003902","url":null,"abstract":"Spoken language understanding (SLU) is a key component of task-oriented dialogue systems. SLU parses natural language user utterances into semantic frames. Previous work has shown that incorporating context information significantly improves SLU performance for multi-turn dialogues. However, collecting a large-scale human-labeled multi-turn dialogue corpus for the target domains is complex and costly. To reduce dependency on the collection and annotation effort, we propose a Context Encoding Language Transformer (CELT) model facilitating exploiting various context information for SLU. We explore different transfer learning approaches to reduce dependency on data collection and annotation. In addition to unsupervised pre-training using large-scale general purpose unlabeled corpora, such as Wikipedia, we explore unsupervised and supervised adaptive training approaches for transfer learning to benefit from other in-domain and out-of-domain dialogue corpora. Experimental results demonstrate that the proposed model with the proposed transfer learning approaches achieves significant improvement on the SLU performance over state-of-the-art models on two large-scale single-turn dialogue benchmarks and one large-scale multi-turn dialogue benchmark.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126833518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5