2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献

筛选
英文 中文
Leveraging Language ID in Multilingual End-to-End Speech Recognition 在多语言端到端语音识别中利用语言ID
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003870
Austin Waters, Neeraj Gaur, Parisa Haghani, P. Moreno, Zhongdi Qu
{"title":"Leveraging Language ID in Multilingual End-to-End Speech Recognition","authors":"Austin Waters, Neeraj Gaur, Parisa Haghani, P. Moreno, Zhongdi Qu","doi":"10.1109/ASRU46091.2019.9003870","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003870","url":null,"abstract":"Recent advances in end-to-end speech recognition have made it possible to build multilingual models, capable of recognizing speech in multiple languages. Multilingual models can outperform their monolingual counterparts, depending on the amount of training data and the relatedness of languages. However, in some cases, these models rely on having perfect knowledge of the language being spoken; that is, they expect to be provided with an external language ID that augments the input features or modulates internal layers of the network. In this paper, we introduce a novel technique for inferring the language ID in a streaming fashion using RNN-T, and a novel loss function that pressures the model to identify the language after as few frames as possible. The output of this streaming language-ID model is used in training and inference of a multilingual recognition model. We show the effectiveness of our approach through experiments on two sets of languages, one consisting of different dialects of Arabic, and the other consisting of Nordic languages, Finnish and Dutch.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121732614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Speech Reveals Future Risk of Developing Dementia: Predictive Dementia Screening from Biographic Interviews 言语揭示未来患痴呆症的风险:从传记访谈中预测痴呆症筛查
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003908
Jochen Weiner, C. Frankenberg, J. Schröder, Tanja Schultz
{"title":"Speech Reveals Future Risk of Developing Dementia: Predictive Dementia Screening from Biographic Interviews","authors":"Jochen Weiner, C. Frankenberg, J. Schröder, Tanja Schultz","doi":"10.1109/ASRU46091.2019.9003908","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003908","url":null,"abstract":"Alzheimer's disease is a progressive incurable condition for which the success of any symptomatic therapy depends crucially on the starting time. Ideally it starts before the disease has caused any cognitive impairments. Our work aims at developing speech-based dementia screening methods that detect dementia as early as possible. Here, we aim to predict the outbreak even before clinical screening tests can diagnose the disease. Using the longitudinal ILSE study, we automatically extract features from biographic interviews and predict the development of dementia 5 and 12 years into the future. Our prediction system achieves results of 73.3% and 75.7% unweighted average recall (UAR), respectively, which clearly outperform a prediction based on prior diagnoses or disease prevalence. Thus, the automated analysis of spoken interviews offers a highly effective prediction procedure that allows for easy-to-use, cost-effective casual testing.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129442356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Long Range Acoustic and Deep Features Perspective on ASVspoof 2019 ASVspoof 2019的远程声学和深度特征透视
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003845
Rohan Kumar Das, Jichen Yang, Haizhou Li
{"title":"Long Range Acoustic and Deep Features Perspective on ASVspoof 2019","authors":"Rohan Kumar Das, Jichen Yang, Haizhou Li","doi":"10.1109/ASRU46091.2019.9003845","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003845","url":null,"abstract":"To secure automatic speaker verification (ASV) systems from intruders, robust countermeasures for spoofing attack detection are required. The ASVspoof series of challenge provides a shared anti-spoofing task. The recent edition, ASVspoof 2019, focuses on attacks by both synthetic and replay speech that are referred to as logical and physical access attacks, respectively. In the ASVspoof 2019 submission, we considered novel countermeasures based on long range acoustic features, that are unique in many ways as they are derived using octave power spectrum and subbands, as opposed to the commonly used linear power spectrum. During the post-challenge study, we further investigate the use of deep features that enhances the discriminative ability between genuine and spoofed speech. In this paper, we summarize the findings from the perspective of long range acoustic and deep features for spoof detection. We make a comprehensive analysis on the nature of different kinds of spoofing attacks and system development.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132561265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
Personalization of End-to-End Speech Recognition on Mobile Devices for Named Entities 移动设备上命名实体端到端语音识别的个性化
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003775
K. Sim, F. Beaufays, Arnaud Benard, Dhruv Guliani, Andreas Kabel, Nikhil Khare, Tamar Lucassen, P. Zadražil, Harry Zhang, Leif T. Johnson, Giovanni Motta, Lillian Zhou
{"title":"Personalization of End-to-End Speech Recognition on Mobile Devices for Named Entities","authors":"K. Sim, F. Beaufays, Arnaud Benard, Dhruv Guliani, Andreas Kabel, Nikhil Khare, Tamar Lucassen, P. Zadražil, Harry Zhang, Leif T. Johnson, Giovanni Motta, Lillian Zhou","doi":"10.1109/ASRU46091.2019.9003775","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003775","url":null,"abstract":"We study the effectiveness of several techniques to personalize end-to-end speech models and improve the recognition of proper names relevant to the user. These techniques differ in the amounts of user effort required to provide supervision, and are evaluated on how they impact speech recognition performance. We propose using keyword-dependent precision and recall metrics to measure vocabulary acquisition performance. We evaluate the algorithms on a dataset that we designed to contain names of persons that are difficult to recognize. Therefore, the baseline recall rate for proper names in this dataset is very low: 2.4%. A data synthesis approach we developed brings it to 48.6%, with no need for speech input from the user. With speech input, if the user corrects only the names, the name recall rate improves to 64.4%. If the user corrects all the recognition errors, we achieve the best recall of 73.5%. To eliminate the need to upload user data and store personalized models on a server, we focus on performing the entire personalization workflow on a mobile device.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"3 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114113061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Highly Efficient Neural Network Language Model Compression Using Soft Binarization Training 基于软二值化训练的高效神经网络语言模型压缩
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003744
Rao Ma, Qi Liu, Kai Yu
{"title":"Highly Efficient Neural Network Language Model Compression Using Soft Binarization Training","authors":"Rao Ma, Qi Liu, Kai Yu","doi":"10.1109/ASRU46091.2019.9003744","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003744","url":null,"abstract":"The long short-term memory language model (LSTM LM) has been widely investigated in large vocabulary continuous speech recognition (LVCSR) task. Despite the excellent performance of LSTM LM, its usage in resource-constrained environments, such as portable devices, is limited due to the high consumption of memory. Binarized language model has been proposed to achieve significant memory reduction at the cost of performance degradation at high compression ratio. In this paper, we propose a soft binarization approach to recover the performance of binarized LSTM LM. Experiments show that the proposed method can achieve a high compression rate of 30 × with almost no performance loss in both language modeling and speech recognition tasks.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"336 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114724467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Exploring Model Units and Training Strategies for End-to-End Speech Recognition 探索端到端语音识别的模型单元和训练策略
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003834
Mingkun Huang, Yizhou Lu, Lan Wang, Y. Qian, Kai Yu
{"title":"Exploring Model Units and Training Strategies for End-to-End Speech Recognition","authors":"Mingkun Huang, Yizhou Lu, Lan Wang, Y. Qian, Kai Yu","doi":"10.1109/ASRU46091.2019.9003834","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003834","url":null,"abstract":"In this work, we explore end-to-end speech recognition models (CTC, RNN-Transducer and attention-based models) with different model units (character, wordpiece and word) and various training strategies. We show that wordpiece unit outperforms character unit for all end-to-end systems on the Switchboard Hub5'00 benchmark. To improve the performance of end-to-end systems, we propose a multi-stage pretraining strategy, which gives 25.0% and 18.0% relative improvements over training from scratch for attention and RNN-T models respectively with wordpiece units. We achieve state-of-the-art performance on the Switchboard+Fisher-2000h task, outperforming all prior work. Together with other training strategies such as label smoothing and data augmentation, we achieve 5.9%/12.1% WER on the Switch-board/CallHome test set without using any external language models. This is a new performance milestone for a single end-to-end system, and it is also much better than the previous published best hybrid system, which is 6.7%/12.5% on each set individually.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124334140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
On the Study of Generative Adversarial Networks for Cross-Lingual Voice Conversion 基于生成对抗网络的跨语言语音转换研究
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003939
Berrak Sisman, Mingyang Zhang, M. Dong, Haizhou Li
{"title":"On the Study of Generative Adversarial Networks for Cross-Lingual Voice Conversion","authors":"Berrak Sisman, Mingyang Zhang, M. Dong, Haizhou Li","doi":"10.1109/ASRU46091.2019.9003939","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003939","url":null,"abstract":"Cross-lingual voice conversion (VC) aims to convert the source speaker's voice to sound like that of the target speaker, when the source and target speakers speak different languages. In this paper, we propose to use Generative Adversarial Networks (GANs) for cross-lingual voice-conversion. We further the studies on Variational Autoencoding Wasserstein GAN (VAW-GAN) and cycle-consistent adversarial network (CycleGAN), that are known to be effective for mono-lingual voice conversion. As cross-lingual voice conversion needs to converts the voice across different phonetic system, it is more challenging than mono-lingual voice conversion. By using VAW-GAN and CycleGAN, we successfully convert the speaker identity while carrying over the source speaker's linguistic content. The proposed idea is unique in the sense that it neither relies on bilingual data and their alignment, nor any external process, such as ASR. Moreover, it works with limited amount of training data of any two languages. To our best knowledge, this is the first comprehensive study of Generative Adversarial Networks in cross-lingual voice conversion. In the experiments, we achieve high-quality converted voice, that performs equally well or better than mono-lingual voice conversion.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123242890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Mixed Bandwidth Acoustic Modeling Leveraging Knowledge Distillation 基于知识蒸馏的混合带宽声学建模
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003760
Takashi Fukuda, Samuel Thomas
{"title":"Mixed Bandwidth Acoustic Modeling Leveraging Knowledge Distillation","authors":"Takashi Fukuda, Samuel Thomas","doi":"10.1109/ASRU46091.2019.9003760","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003760","url":null,"abstract":"Training of mixed bandwidth acoustic models have recently been realized by incorporating special Mel filterbanks. To fit information into every filterbank bin available across both narrowband and wideband data, these filterbanks pad zeros at high frequency ranges of narrowband data. Although these methods succeed in decreasing word error rates (WER) on broadband data, they fail to improve on narrowband signals. In this paper, we propose methods to mitigate these effects with generalized knowledge distillation. In our method, specialized teacher networks are first trained on lossless acoustic features with full scale Mel filterbanks. While training student networks, privileged knowledge from these teacher networks is then used to compensate for missing information at high frequencies introduced by the special Mel filterbanks. We show the benefit of the proposed technique for both narrowband (10% relative WER improvement) and wideband data (7.5% relative WER improvement) on the Aurora 4 task over traditional methods.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"86 14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126285591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Small-Footprint Keyword Spotting with Graph Convolutional Network 基于图卷积网络的小足迹关键词识别
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004005
Xi Chen, S. Yin, Dandan Song, P. Ouyang, Leibo Liu, Shaojun Wei
{"title":"Small-Footprint Keyword Spotting with Graph Convolutional Network","authors":"Xi Chen, S. Yin, Dandan Song, P. Ouyang, Leibo Liu, Shaojun Wei","doi":"10.1109/ASRU46091.2019.9004005","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004005","url":null,"abstract":"Despite the recent successes of deep neural networks, it remains challenging to achieve high precision keyword spotting task (KWS) on resource-constrained devices. In this study, we propose a novel context-aware and compact architecture for keyword spotting task. Based on residual connection and bottleneck structure, we design a compact and efficient network for KWS task. To leverage the long range dependencies and global context of the convolutional feature maps, the graph convolutional network is introduced to encode the nonlocal relations. By evaluated on the Google Speech Command Dataset, the proposed method achieves state-of-the-art performance and outperforms the prior works by a large margin with lower computational cost.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"18 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134259148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Incremental Lattice Determinization for WFST Decoders WFST解码器的增量点阵确定
2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004006
Zhehuai Chen, M. Yarmohammadi, Hainan Xu, Hang Lv, Lei Xie, Daniel Povey, S. Khudanpur
{"title":"Incremental Lattice Determinization for WFST Decoders","authors":"Zhehuai Chen, M. Yarmohammadi, Hainan Xu, Hang Lv, Lei Xie, Daniel Povey, S. Khudanpur","doi":"10.1109/ASRU46091.2019.9004006","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004006","url":null,"abstract":"We introduce a lattice determinization algorithm that can operate incrementally. That is, a word-level lattice can be generated for a partial utterance and then, once we have processed more audio, we can obtain a word-level lattice for the extended utterance without redoing all the work of lattice determinization. This is relevant for ASR decoders such as those used in Kaldi, which first generate a state-level lattice and then convert it to a word-level lattice using a determinization algorithm in a special semiring. Our incremental determinization algorithm is useful when word-level lattices are needed prior to the end of the utterance, and also reduces the latency due to determinization at the end of the utterance.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132212357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信