2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献

筛选
英文 中文
Reducing the computational complexity for whole word models 降低全词模型的计算复杂度
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268917
H. Soltau, H. Liao, H. Sak
{"title":"Reducing the computational complexity for whole word models","authors":"H. Soltau, H. Liao, H. Sak","doi":"10.1109/ASRU.2017.8268917","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268917","url":null,"abstract":"In a previous study, we demonstrated the feasibility to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. In that system, we model about 100,000 words directly using deep bi-directional LSTM RNNs. To alleviate the data sparsity problem for word models, we train the model on 125,000 hours of semi-supervised acoustic training data. The resulting model works very well as an end-to-end all-neural speech recognition model without the use of any language model removing the need to decode. However, the very large output layer increases the computational cost substantially. In this work we address this issue by adding TDNN (Time Delay Neural Network) layers that reduce the frame rate to 120ms for the output layer. The TDNN layers are interspersed with the LSTM layers, gradually reducing the frame rate from 10ms to 120ms. The new model reduces the computational cost by 60% while improving the word error rate by 6% relative. Compared to a traditional LVCSR system, the whole word speech recognizer uses about the same CPU cycles and can easily be parallelized across CPU cores or run on GPUs.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125144702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems 基于dlstm的语音合成系统中激励参数的感知质量和建模精度
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269001
Eunwoo Song, F. Soong, Hong-Goo Kang
{"title":"Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems","authors":"Eunwoo Song, F. Soong, Hong-Goo Kang","doi":"10.1109/ASRU.2017.8269001","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269001","url":null,"abstract":"This paper investigates how the perceptual quality of the synthesized speech is affected by reconstruction errors in excitation signals generated by a deep learning-based statistical model. In this framework, the excitation signal obtained by an LPC inverse filter is first decomposed into harmonic and noise components using an improved time-frequency trajectory excitation (ITFTE) scheme, then they are trained and generated by a deep long short-term memory (DLSTM)-based speech synthesis system. By controlling the parametric dimension of the ITFTE vocoder, we analyze the impact of the harmonic and noise components to the perceptual quality of the synthesized speech. Both objective and subjective experimental results confirm that the maximum perceptually allowable spectral distortion for the harmonic spectrum of the generated excitation is ∼0.08 dB. On the other hand, the absolute spectral distortion in the noise components is meaningless, and only the spectral envelope is relevant to the perceptual quality.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"608 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123335177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Binaural processing for robust recognition of degraded speech 退化语音鲁棒识别的双耳处理
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268912
Anjali Menon, Chanwoo Kim, Umpei Kurokawa, R. Stern
{"title":"Binaural processing for robust recognition of degraded speech","authors":"Anjali Menon, Chanwoo Kim, Umpei Kurokawa, R. Stern","doi":"10.1109/ASRU.2017.8268912","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268912","url":null,"abstract":"This paper discusses a new combination of techniques that help in improving the accuracy of speech recognition in adverse conditions using two microphones. Classic approaches toward binaural speech processing use some form of cross-correlation over time across the two sensors to effectively isolate target speech from interferers. Several additional techniques using temporal and spatial masking have been proposed in the past to improve recognition accuracy in the presence of reverberation and interfering talkers. In this paper, we consider the use of cross-correlation across frequency over some limited range of frequency channels in addition to the existing methods of monaural and binaural processing. This has the effect of locating and reinforcing coincident peaks across frequency over the representation of binaural interaction and provides local smoothing over the specified range of frequencies. Combined with the temporal and spatial masking techniques mentioned above, this leads to significant improvements in binaural speech recognition.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126263783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic speech recognition of Arabic multi-genre broadcast media 阿拉伯语多类型广播媒体语音自动识别
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268957
M. Najafian, Wei-Ning Hsu, Ahmed Ali, James R. Glass
{"title":"Automatic speech recognition of Arabic multi-genre broadcast media","authors":"M. Najafian, Wei-Ning Hsu, Ahmed Ali, James R. Glass","doi":"10.1109/ASRU.2017.8268957","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268957","url":null,"abstract":"This paper describes an Arabic Automatic Speech Recognition system developed on 15 hours of Multi-Genre Broadcast (MGB-3) data from YouTube, plus 1,200 hours of Multi-Dialect and Multi-Genre MGB-2 data recorded from the Aljazeera Arabic TV channel. In this paper, we report our investigations of a range of signal pre-processing, data augmentation, topic-specific language model adaptation, accent specific re-training, and deep learning based acoustic modeling topologies, such as feed-forward Deep Neural Networks (DNNs), Time-delay Neural Networks (TDNNs), Long Short-term Memory (LSTM) networks, Bidirectional LSTMs (BLSTMs), and a Bidirectional version of the Prioritized Grid LSTM (BPGLSTM) model. We propose a system combination for three purely sequence trained recognition systems based on lattice-free maximum mutual information, 4-gram language model re-scoring, and system combination using the minimum Bayes risk decoding criterion. The best word error rate we obtained on the MGB-3 Arabic development set using a 4-gram re-scoring strategy is 42.25% for a chain BLSTM system, compared to 65.44% baseline for a DNN system.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129470727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
An investigation of multi-speaker training for wavenet vocoder 波网络声码器的多说话人训练研究
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269007
Tomoki Hayashi, Akira Tamamori, Kazuhiro Kobayashi, K. Takeda, T. Toda
{"title":"An investigation of multi-speaker training for wavenet vocoder","authors":"Tomoki Hayashi, Akira Tamamori, Kazuhiro Kobayashi, K. Takeda, T. Toda","doi":"10.1109/ASRU.2017.8269007","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269007","url":null,"abstract":"In this paper, we investigate the effectiveness of multi-speaker training for WaveNet vocoder. In our previous work, we have demonstrated that our proposed speaker-dependent (SD) WaveNet vocoder, which is trained with a single speaker's speech data, is capable of modeling temporal waveform structure, such as phase information, and makes it possible to generate more naturally sounding synthetic voices compared to conventional high-quality vocoder, STRAIGHT. However, it is still difficult to generate synthetic voices of various speakers using the SD-WaveNet due to its speaker-dependent property. Towards the development of speaker-independent WaveNet vocoder, we apply multi-speaker training techniques to the WaveNet vocoder and investigate its effectiveness. The experimental results demonstrate that 1) the multispeaker WaveNet vocoder still outperforms STRAIGHT in generating known speakers' voices but it is comparable to STRAIGHT in generating unknown speakers' voices, and 2) the multi-speaker training is effective for developing the WaveNet vocoder capable of speech modification.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133884138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 99
DBLSTM based multilingual articulatory feature extraction for language documentation 基于DBLSTM的语言文档多语种发音特征提取
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268966
Markus Müller, Sebastian Stüker, A. Waibel
{"title":"DBLSTM based multilingual articulatory feature extraction for language documentation","authors":"Markus Müller, Sebastian Stüker, A. Waibel","doi":"10.1109/ASRU.2017.8268966","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268966","url":null,"abstract":"With more than 7,000 living languages in the world and many of them facing extinction, the need for language documentation is now more pressing than ever. This process is time-consuming, requiring linguists as each language features peculiarities that need to be addressed. While automating the whole process is difficult, we aim at providing methods to support linguists during documentation. One important step in the workflow is the discovery of the phonetic inventory. In the past, we proposed a first approach of first automatically segmenting recordings into phone-line units and second clustering these segments based on acoustic similarity, determined by articulatory features (AFs). We now propose a refined method using Deep Bi-directional LSTMs (DBLSTMs) over DNNs. Additionally, we use Language Feature Vectors (LFVs) which encode language specific peculiarities in a low dimensional representation. In contrast to adding LFVs to the acoustic input features, we modulated the output of the last hidden LSTM layer, forcing groups of LSTM cells to adapt to language related features. We evaluated our approach multilingually, using data from multiple languages. Results show an improvement in recognition accuracy across AF types: While LFVs improved the performance of DNNs, the gain is even bigger when using DBLSTMs.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123608171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Neural relevance-aware query modeling for spoken document retrieval 语音文档检索的神经关联感知查询建模
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268973
Tien-Hong Lo, Ying-Wen Chen, Kuan-Yu Chen, H. Wang, Berlin Chen
{"title":"Neural relevance-aware query modeling for spoken document retrieval","authors":"Tien-Hong Lo, Ying-Wen Chen, Kuan-Yu Chen, H. Wang, Berlin Chen","doi":"10.1109/ASRU.2017.8268973","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268973","url":null,"abstract":"Spoken document retrieval (SDR) is becoming a much-needed application due to that unprecedented volumes of audio-visual media have been made available in our daily life. As far as we are aware, most of the wide variety of SDR methods mainly focus on exploring robust indexing and effective retrieval methods to quantify the relevance degree between a pair of query and document. However, similar to information retrieval (IR), a fundamental challenge facing SDR is that a query is usually too short to convey a user's information need, such that a retrieval system cannot always achieve prospective efficacy when with the existing retrieval methods. In order to further boost retrieval performance, several studies turn their attention to reformulating the original query by leveraging an online pseudo-relevance feedback (PRF) process, which often comes at the price of taking significant time. Motivated by these observations, this paper presents a novel extension of the general line of SDR research and its contribution is at least two-fold. First, building on neural network-based techniques, we put forward a neural relevance-aware query modeling (NRM) framework, which is designed to not only infer a discriminative query language model automatically for a given query, but also get around the time-consuming PRF process. Second, the utility of the methods instantiated from our proposed framework and several widely-used retrieval methods are extensively analyzed and compared on a standard SDR task, which suggests the superiority of our methods.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122184573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Investigation of transfer learning for ASR using LF-MMI trained neural networks 基于LF-MMI训练神经网络的ASR迁移学习研究
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268947
Pegah Ghahremani, Vimal Manohar, Hossein Hadian, Daniel Povey, S. Khudanpur
{"title":"Investigation of transfer learning for ASR using LF-MMI trained neural networks","authors":"Pegah Ghahremani, Vimal Manohar, Hossein Hadian, Daniel Povey, S. Khudanpur","doi":"10.1109/ASRU.2017.8268947","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268947","url":null,"abstract":"It is common in applications of ASR to have a large amount of data out-of-domain to the test data and a smaller amount of in-domain data similar to the test data. In this paper, we investigate different ways to utilize this out-of-domain data to improve ASR models based on Lattice-free MMI (LF-MMI). In particular, we experiment with multi-task training using a network with shared hidden layers; and we try various ways of adapting previously trained models to a new domain. Both types of methods are effective in reducing the WER versus in-domain models, with the jointly trained models generally giving more improvement.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129084341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 75
Investigating native and non-native English classification and transfer effects using Legendre polynomial coefficient clustering 利用勒让德多项式系数聚类研究母语和非母语英语的分类和迁移效应
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268996
Rachel Rakov, A. Rosenberg
{"title":"Investigating native and non-native English classification and transfer effects using Legendre polynomial coefficient clustering","authors":"Rachel Rakov, A. Rosenberg","doi":"10.1109/ASRU.2017.8268996","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268996","url":null,"abstract":"In this paper, we investigate similarities and differences in pitch contours among native English speakers and non-native English speakers (whose first language is Mandarin). In particular, we investigate if there are particular prosodic contours that are predictive of native and non-native English speech in the area of question intonation contours. We also look to see if we find evidence of negative transfer effects or second language learning effects around native Mandarin speakers who may be using Mandarin prosody when speaking English. To investigate these questions, we explore prosodic contour modeling techniques for native and non-native English speech by clustering Legendre polynomial coefficients. Our results show evidence of non-native English speakers using unexpected contours in the place of expected English prosody. We additionally find support that speakers in our corpus may be experiencing negative language transfer effects, as well as second language learning effects.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126487375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The CMU entry to blizzard machine learning challenge CMU参加暴雪机器学习挑战赛
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268997
P. Baljekar, Sai Krishna Rallabandi, A. Black
{"title":"The CMU entry to blizzard machine learning challenge","authors":"P. Baljekar, Sai Krishna Rallabandi, A. Black","doi":"10.1109/ASRU.2017.8268997","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268997","url":null,"abstract":"The paper describes Carnegie Mellon University's (CMU) entry to the ES-1 sub-task of the Blizzard Machine Learning Speech Synthesis Challenge 2017. The submitted system is a parametric model trained to predict vocoder parameters given linguistic features. The task in this year's challenge was to synthesize speech from children's audiobooks. Linguistic and acoustic features were provided by the organizers and the task was to find the best performing model. The paper explores various RNN architectures that were investigated and describes the final model that was submitted.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121188713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信