2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献

筛选
英文 中文
Training Language Models for Long-Span Cross-Sentence Evaluation 大跨度跨句评价的语言模型训练
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003788
Kazuki Irie, Albert Zeyer, R. Schlüter, H. Ney
{"title":"Training Language Models for Long-Span Cross-Sentence Evaluation","authors":"Kazuki Irie, Albert Zeyer, R. Schlüter, H. Ney","doi":"10.1109/ASRU46091.2019.9003788","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003788","url":null,"abstract":"While recurrent neural networks can motivate cross-sentence language modeling and its application to automatic speech recognition (ASR), corresponding modifications of the training method for that end are rarely discussed. In fact, even more generally, the impact of training sequence construction strategy in language modeling for different evaluation conditions is typically ignored. In this work, we revisit this basic but fundamental question. We train language models based on long short-term memory recurrent neural networks and Transformers using various types of training sequences and study their robustness with respect to different evaluation modes. Our experiments on 300h Switchboard and Quaero English datasets show that models trained with back-propagation over sequences consisting of concatenation of multiple sentences with state carry-over across sequences effectively outperform those trained with the sentence-level training, both in terms of perplexity and word error rates for cross-utterance ASR.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131524976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Novel Enhanced Teager Energy Based Cepstral Coefficients for Replay Spoof Detection 用于重放欺骗检测的新型增强Teager能量倒谱系数
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003934
R. Acharya, H. Patil, Harsh Kotta
{"title":"Novel Enhanced Teager Energy Based Cepstral Coefficients for Replay Spoof Detection","authors":"R. Acharya, H. Patil, Harsh Kotta","doi":"10.1109/ASRU46091.2019.9003934","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003934","url":null,"abstract":"Replay attack on voice biometric, refers to the fraudulent attempt made by an imposter to spoof another person's identity by replaying the pre-recorded voice samples in front of an Automatic Speaker Verification (ASV) system. In an attempt to develop countermeasures against replay attack, this paper proposes to use a new feature set, namely, Enhanced Teager Energy Cepstral Coefficients (ETECC) using the recently introduced concept of signal mass. Results obtained on ASVspoof 2017 version 2.0 dataset suggest that the proposed feature set performs better than the original Teager Energy Cepstral Coefficients (TECC) feature set because the Enhanced Teager Energy Operator (ETEO) gives a better estimate of signal's energy as compared to the Teager Energy Operator (TEO). We obtained 53.3% and 51.35% reduction in EER on development and evaluation dataset, respectively, with respect to the baseline system.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133339630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Towards Real-Time Mispronunciation Detection in Kids' Speech 儿童语音中的语音错误实时检测
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003863
Peter William VanHarn Plantinga, E. Fosler-Lussier
{"title":"Towards Real-Time Mispronunciation Detection in Kids' Speech","authors":"Peter William VanHarn Plantinga, E. Fosler-Lussier","doi":"10.1109/ASRU46091.2019.9003863","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003863","url":null,"abstract":"Modern mispronunciation detection and diagnosis systems have seen significant gains in accuracy due to the introduction of deep learning. However, these systems have not been evaluated for the ability to be run in real-time, an important factor in applications that provide rapid feedback. In particular, the state-of-the-art uses bi-directional recurrent networks, where a uni-directional network may be more appropriate. Teacher-student learning is a natural approach to use to improve a uni-directional model, but when using a CTC objective, this is limited by poor alignment of outputs to evidence. We address this limitation by trying two loss terms for improving the alignments of our models. One loss is an “alignment loss” term that encourages outputs only when features do not resemble silence. The other loss term uses a uni-directional model as teacher model to align the bi-directional model. Our proposed model uses these aligned bi-directional models as teacher models. Experiments on the CSLU kids' corpus show that these changes decrease the latency of the outputs, and improve the detection rates, with a trade-off between these goals.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130750236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition 基于师生学习的端到端语音识别领域自适应
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003776
Zhong Meng, Jinyu Li, Yashesh Gaur, Y. Gong
{"title":"Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition","authors":"Zhong Meng, Jinyu Li, Yashesh Gaur, Y. Gong","doi":"10.1109/ASRU46091.2019.9003776","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003776","url":null,"abstract":"Teacher-student (T/S) has shown to be effective for domain adaptation of deep neural network acoustic models in hybrid speech recognition systems. In this work, we extend the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance. To further improve T/S learning with the help of ground-truth labels, we propose adaptive T/S (AT/S) learning. Instead of conditionally choosing from either the teacher's soft token posteriors or the one-hot ground-truth label, in AT/S, the student always learns from both the teacher and the ground truth with a pair of adaptive weights assigned to the soft and one-hot labels quantifying the confidence on each of the knowledge sources. The confidence scores are dynamically estimated at each decoder step as a function of the soft and one-hot labels. With 3400 hours parallel close-talk and far-field Microsoft Cortana data for domain adaptation, T/S and AT/S achieves 6.3% and 10.3% relative word error rate improvement over a strong E2E model trained with the same amount of far-field data.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132311993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Learning Between Different Teacher and Student Models in ASR ASR中不同师生模式之间的学习
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003756
J. H. M. Wong, M. Gales, Yu Wang
{"title":"Learning Between Different Teacher and Student Models in ASR","authors":"J. H. M. Wong, M. Gales, Yu Wang","doi":"10.1109/ASRU46091.2019.9003756","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003756","url":null,"abstract":"Teacher-student learning can be applied in automatic speech recognition for model compression and domain adaptation. This trains a student model to emulate the behaviour of a teacher model, and only the student is used to perform recognition. Depending on the application, the teacher and student may differ in their model types, complexities, input contexts, and input features. In previous works, it is often shown that learning from a strong teacher allows the student to perform better than an equivalent model trained with only the reference transcriptions. However, there has not been much investigation into whether a particular form of teacher is appropriate for the student to learn from. This paper aims to study how effectively the student is able to learn from the teacher, when differences exist between their designs. The Augmented Multi-party Interaction (AMI) meeting transcription and Multi-Genre Broadcast (MGB-3) television broadcast audio tasks are used in this analysis. Experimental results suggest that a student can effectively learn from a more complex teacher, but may struggle when it lacks input information. It is therefore important to carefully consider the design of the student for each application.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120965225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Markov Recurrent Neural Network Language Model 马尔科夫递归神经网络语言模型
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003850
Jen-Tzung Chien, Che-Yu Kuo
{"title":"Markov Recurrent Neural Network Language Model","authors":"Jen-Tzung Chien, Che-Yu Kuo","doi":"10.1109/ASRU46091.2019.9003850","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003850","url":null,"abstract":"Recurrent neural network (RNN) has achieved a great success in language modeling where the temporal information based on deterministic state is continuously extracted and evolved through time. Such a simple deterministic transition function using input-to-hidden and hidden-to-hidden weights is usually insufficient to reflect the diversities and variations of latent variable structure behind the heterogeneous natural language. This paper presents a new stochastic Markov RNN (MRNN) to strengthen the learning capability in language model where the trajectory of word sequences is driven by a neural Markov process with Markov state transitions based on a K-state long short-term memory model. A latent state machine is constructed to characterize the complicated semantics in the structured lexical patterns. Gumbel-softmax is introduced to implement the stochastic backpropatation algorithm with discrete states. The parallel computation for rapid realization of MRNN is presented. The variational Bayesian learning procedure is implemented. Experiments demonstrate the merits of stochastic and diverse representation using MRNN language model where the overhead of parameters and computations is limited.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123583424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Optimizing Neural Network Embeddings Using a Pair-Wise Loss for Text-Independent Speaker Verification 基于对损失的文本无关说话人验证神经网络嵌入优化
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003794
Hira Dhamyal, Tianyan Zhou, B. Raj, Rita Singh
{"title":"Optimizing Neural Network Embeddings Using a Pair-Wise Loss for Text-Independent Speaker Verification","authors":"Hira Dhamyal, Tianyan Zhou, B. Raj, Rita Singh","doi":"10.1109/ASRU46091.2019.9003794","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003794","url":null,"abstract":"This paper proposes a new loss function called the “quartet” loss for the better optimization of the neural networks for matching tasks. For such tasks, where neural network embeddings are the key component, the optimization of the network for better embeddings is critical. The embeddings are required to be class discriminative, resulting in minimal inter-class variation and maximal intra-class variation even for unseen classes for better generalization of the network. The quartet loss explicitly computes the distance metric between pairs of inputs and increases the gap between the similarity score distributions between the same class pairs and the different class pairs. We evaluate on the speaker verification task and demonstrate the performance of the loss on our proposed neural network.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125026657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Spoken Multiple-Choice Question Answering Using Multimodal Convolutional Neural Networks 使用多模态卷积神经网络回答口语选择题
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003966
Shang-Bao Luo, Hung-Shin Lee, Kuan-Yu Chen, H. Wang
{"title":"Spoken Multiple-Choice Question Answering Using Multimodal Convolutional Neural Networks","authors":"Shang-Bao Luo, Hung-Shin Lee, Kuan-Yu Chen, H. Wang","doi":"10.1109/ASRU46091.2019.9003966","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003966","url":null,"abstract":"In a spoken multiple-choice question answering (MCQA) task, where passages, questions, and choices are given in the form of speech, usually only the auto-transcribed text is considered in system development. The acoustic-level information may contain useful cues for answer prediction. However, to the best of our knowledge, only a few studies focus on using the acoustic-level information or fusing the acoustic-level information with the text-level information for a spoken MCQA task. Therefore, this paper presents a hierarchical multistage multimodal (HMM) framework based on convolutional neural networks (CNNs) to integrate text- and acoustic-level statistics into neural modeling for spoken MCQA. Specifically, the acoustic-level statistics are expected to offset text inaccuracies caused by automatic speech recognition (ASR) systems or representation inadequacy lurking in word embedding generators, thereby making the spoken MCQA system robust. In the proposed HMM framework, two modalities are first manipulated to separately derive the acoustic- and text-level representations for the passage, question, and choices. Next, these clever features are jointly involved in inferring the relationships among the passage, question, and choices. Then, a final representation is derived for each choice, which encodes the relationship of the choice to the passage and question. Finally, the most likely answer is determined based on the individual final representations of all choices. Evaluated on the data of “Formosa Grand Challenge - Talk to AI”, a Mandarin Chinese spoken MCQA contest held in 2018, the proposed HMM framework achieves remarkable improvements in accuracy over the text-only baseline.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114764956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Dropout-Based Single Model Committee Approach for Active Learning in ASR 基于辍学的ASR主动学习单模型委员会方法
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003728
Jiayi Fu, Kuang Ru
{"title":"A Dropout-Based Single Model Committee Approach for Active Learning in ASR","authors":"Jiayi Fu, Kuang Ru","doi":"10.1109/ASRU46091.2019.9003728","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003728","url":null,"abstract":"In this paper, we proposed a new committee-based approach for active learning (AL) in automatic speech recognition (ASR). This approach can achieve lower recognition word error rate (WER) with fewer transcription by selecting the most informative samples. Different from previous committee-based AL approaches, the committee construction process of this approach needs to train only one acoustic model(AM) with dropout. Since only one model needs to be trained, this approach is simpler and faster. At the same time, the AM will be improved continuously, we also found this approach is more robust to its improvement. In experiments, we compared our approach with the random sampling and another state-of-the-art committee-based approach: heterogeneous neural networks (HNN) based approach. We examined our approach in WER, the time to construct committee and the robustness of model improvement in the Mandarin ASR task with 1600 hours speech data. The results showed that it achieves 2–3 times relative WER reduction compare with the random sampling, and it only uses 75% the time to achieve close WER with HNN-based approach.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125717107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transfer Learning for Context-Aware Spoken Language Understanding 语境感知口语理解的迁移学习
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003902
Qian Chen, Zhu Zhuo, Wen Wang, Qiuyun Xu
{"title":"Transfer Learning for Context-Aware Spoken Language Understanding","authors":"Qian Chen, Zhu Zhuo, Wen Wang, Qiuyun Xu","doi":"10.1109/ASRU46091.2019.9003902","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003902","url":null,"abstract":"Spoken language understanding (SLU) is a key component of task-oriented dialogue systems. SLU parses natural language user utterances into semantic frames. Previous work has shown that incorporating context information significantly improves SLU performance for multi-turn dialogues. However, collecting a large-scale human-labeled multi-turn dialogue corpus for the target domains is complex and costly. To reduce dependency on the collection and annotation effort, we propose a Context Encoding Language Transformer (CELT) model facilitating exploiting various context information for SLU. We explore different transfer learning approaches to reduce dependency on data collection and annotation. In addition to unsupervised pre-training using large-scale general purpose unlabeled corpora, such as Wikipedia, we explore unsupervised and supervised adaptive training approaches for transfer learning to benefit from other in-domain and out-of-domain dialogue corpora. Experimental results demonstrate that the proposed model with the proposed transfer learning approaches achieves significant improvement on the SLU performance over state-of-the-art models on two large-scale single-turn dialogue benchmarks and one large-scale multi-turn dialogue benchmark.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126833518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信