2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献_第3页

Reducing the computational complexity for whole word models 降低全词模型的计算复杂度

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268917

H. Soltau, H. Liao, H. Sak

{"title":"Reducing the computational complexity for whole word models","authors":"H. Soltau, H. Liao, H. Sak","doi":"10.1109/ASRU.2017.8268917","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268917","url":null,"abstract":"In a previous study, we demonstrated the feasibility to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. In that system, we model about 100,000 words directly using deep bi-directional LSTM RNNs. To alleviate the data sparsity problem for word models, we train the model on 125,000 hours of semi-supervised acoustic training data. The resulting model works very well as an end-to-end all-neural speech recognition model without the use of any language model removing the need to decode. However, the very large output layer increases the computational cost substantially. In this work we address this issue by adding TDNN (Time Delay Neural Network) layers that reduce the frame rate to 120ms for the output layer. The TDNN layers are interspersed with the LSTM layers, gradually reducing the frame rate from 10ms to 120ms. The new model reduces the computational cost by 60% while improving the word error rate by 6% relative. Compared to a traditional LVCSR system, the whole word speech recognizer uses about the same CPU cycles and can easily be parallelized across CPU cores or run on GPUs.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125144702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems 基于dlstm的语音合成系统中激励参数的感知质量和建模精度

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269001

Eunwoo Song, F. Soong, Hong-Goo Kang

引用次数: 1

Binaural processing for robust recognition of degraded speech 退化语音鲁棒识别的双耳处理

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268912

Anjali Menon, Chanwoo Kim, Umpei Kurokawa, R. Stern

引用次数: 0

Automatic speech recognition of Arabic multi-genre broadcast media 阿拉伯语多类型广播媒体语音自动识别

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268957

M. Najafian, Wei-Ning Hsu, Ahmed Ali, James R. Glass

{"title":"Automatic speech recognition of Arabic multi-genre broadcast media","authors":"M. Najafian, Wei-Ning Hsu, Ahmed Ali, James R. Glass","doi":"10.1109/ASRU.2017.8268957","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268957","url":null,"abstract":"This paper describes an Arabic Automatic Speech Recognition system developed on 15 hours of Multi-Genre Broadcast (MGB-3) data from YouTube, plus 1,200 hours of Multi-Dialect and Multi-Genre MGB-2 data recorded from the Aljazeera Arabic TV channel. In this paper, we report our investigations of a range of signal pre-processing, data augmentation, topic-specific language model adaptation, accent specific re-training, and deep learning based acoustic modeling topologies, such as feed-forward Deep Neural Networks (DNNs), Time-delay Neural Networks (TDNNs), Long Short-term Memory (LSTM) networks, Bidirectional LSTMs (BLSTMs), and a Bidirectional version of the Prioritized Grid LSTM (BPGLSTM) model. We propose a system combination for three purely sequence trained recognition systems based on lattice-free maximum mutual information, 4-gram language model re-scoring, and system combination using the minimum Bayes risk decoding criterion. The best word error rate we obtained on the MGB-3 Arabic development set using a 4-gram re-scoring strategy is 42.25% for a chain BLSTM system, compared to 65.44% baseline for a DNN system.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129470727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

An investigation of multi-speaker training for wavenet vocoder 波网络声码器的多说话人训练研究

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269007

Tomoki Hayashi, Akira Tamamori, Kazuhiro Kobayashi, K. Takeda, T. Toda

引用次数: 99

DBLSTM based multilingual articulatory feature extraction for language documentation 基于DBLSTM的语言文档多语种发音特征提取

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268966

Markus Müller, Sebastian Stüker, A. Waibel

{"title":"DBLSTM based multilingual articulatory feature extraction for language documentation","authors":"Markus Müller, Sebastian Stüker, A. Waibel","doi":"10.1109/ASRU.2017.8268966","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268966","url":null,"abstract":"With more than 7,000 living languages in the world and many of them facing extinction, the need for language documentation is now more pressing than ever. This process is time-consuming, requiring linguists as each language features peculiarities that need to be addressed. While automating the whole process is difficult, we aim at providing methods to support linguists during documentation. One important step in the workflow is the discovery of the phonetic inventory. In the past, we proposed a first approach of first automatically segmenting recordings into phone-line units and second clustering these segments based on acoustic similarity, determined by articulatory features (AFs). We now propose a refined method using Deep Bi-directional LSTMs (DBLSTMs) over DNNs. Additionally, we use Language Feature Vectors (LFVs) which encode language specific peculiarities in a low dimensional representation. In contrast to adding LFVs to the acoustic input features, we modulated the output of the last hidden LSTM layer, forcing groups of LSTM cells to adapt to language related features. We evaluated our approach multilingually, using data from multiple languages. Results show an improvement in recognition accuracy across AF types: While LFVs improved the performance of DNNs, the gain is even bigger when using DBLSTMs.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123608171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Neural relevance-aware query modeling for spoken document retrieval 语音文档检索的神经关联感知查询建模

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268973

Tien-Hong Lo, Ying-Wen Chen, Kuan-Yu Chen, H. Wang, Berlin Chen

{"title":"Neural relevance-aware query modeling for spoken document retrieval","authors":"Tien-Hong Lo, Ying-Wen Chen, Kuan-Yu Chen, H. Wang, Berlin Chen","doi":"10.1109/ASRU.2017.8268973","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268973","url":null,"abstract":"Spoken document retrieval (SDR) is becoming a much-needed application due to that unprecedented volumes of audio-visual media have been made available in our daily life. As far as we are aware, most of the wide variety of SDR methods mainly focus on exploring robust indexing and effective retrieval methods to quantify the relevance degree between a pair of query and document. However, similar to information retrieval (IR), a fundamental challenge facing SDR is that a query is usually too short to convey a user's information need, such that a retrieval system cannot always achieve prospective efficacy when with the existing retrieval methods. In order to further boost retrieval performance, several studies turn their attention to reformulating the original query by leveraging an online pseudo-relevance feedback (PRF) process, which often comes at the price of taking significant time. Motivated by these observations, this paper presents a novel extension of the general line of SDR research and its contribution is at least two-fold. First, building on neural network-based techniques, we put forward a neural relevance-aware query modeling (NRM) framework, which is designed to not only infer a discriminative query language model automatically for a given query, but also get around the time-consuming PRF process. Second, the utility of the methods instantiated from our proposed framework and several widely-used retrieval methods are extensively analyzed and compared on a standard SDR task, which suggests the superiority of our methods.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122184573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Investigation of transfer learning for ASR using LF-MMI trained neural networks 基于LF-MMI训练神经网络的ASR迁移学习研究

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268947

Pegah Ghahremani, Vimal Manohar, Hossein Hadian, Daniel Povey, S. Khudanpur

引用次数: 75

Investigating native and non-native English classification and transfer effects using Legendre polynomial coefficient clustering 利用勒让德多项式系数聚类研究母语和非母语英语的分类和迁移效应

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268996

Rachel Rakov, A. Rosenberg

引用次数: 0

The CMU entry to blizzard machine learning challenge CMU参加暴雪机器学习挑战赛

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268997

P. Baljekar, Sai Krishna Rallabandi, A. Black

引用次数: 1