{"title":"Reducing the computational complexity for whole word models","authors":"H. Soltau, H. Liao, H. Sak","doi":"10.1109/ASRU.2017.8268917","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268917","url":null,"abstract":"In a previous study, we demonstrated the feasibility to build a competitive, greatly simplified, large vocabulary continuous speech recognition system with whole words as acoustic units. In that system, we model about 100,000 words directly using deep bi-directional LSTM RNNs. To alleviate the data sparsity problem for word models, we train the model on 125,000 hours of semi-supervised acoustic training data. The resulting model works very well as an end-to-end all-neural speech recognition model without the use of any language model removing the need to decode. However, the very large output layer increases the computational cost substantially. In this work we address this issue by adding TDNN (Time Delay Neural Network) layers that reduce the frame rate to 120ms for the output layer. The TDNN layers are interspersed with the LSTM layers, gradually reducing the frame rate from 10ms to 120ms. The new model reduces the computational cost by 60% while improving the word error rate by 6% relative. Compared to a traditional LVCSR system, the whole word speech recognizer uses about the same CPU cycles and can easily be parallelized across CPU cores or run on GPUs.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125144702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Perceptual quality and modeling accuracy of excitation parameters in DLSTM-based speech synthesis systems","authors":"Eunwoo Song, F. Soong, Hong-Goo Kang","doi":"10.1109/ASRU.2017.8269001","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269001","url":null,"abstract":"This paper investigates how the perceptual quality of the synthesized speech is affected by reconstruction errors in excitation signals generated by a deep learning-based statistical model. In this framework, the excitation signal obtained by an LPC inverse filter is first decomposed into harmonic and noise components using an improved time-frequency trajectory excitation (ITFTE) scheme, then they are trained and generated by a deep long short-term memory (DLSTM)-based speech synthesis system. By controlling the parametric dimension of the ITFTE vocoder, we analyze the impact of the harmonic and noise components to the perceptual quality of the synthesized speech. Both objective and subjective experimental results confirm that the maximum perceptually allowable spectral distortion for the harmonic spectrum of the generated excitation is ∼0.08 dB. On the other hand, the absolute spectral distortion in the noise components is meaningless, and only the spectral envelope is relevant to the perceptual quality.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"608 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123335177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anjali Menon, Chanwoo Kim, Umpei Kurokawa, R. Stern
{"title":"Binaural processing for robust recognition of degraded speech","authors":"Anjali Menon, Chanwoo Kim, Umpei Kurokawa, R. Stern","doi":"10.1109/ASRU.2017.8268912","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268912","url":null,"abstract":"This paper discusses a new combination of techniques that help in improving the accuracy of speech recognition in adverse conditions using two microphones. Classic approaches toward binaural speech processing use some form of cross-correlation over time across the two sensors to effectively isolate target speech from interferers. Several additional techniques using temporal and spatial masking have been proposed in the past to improve recognition accuracy in the presence of reverberation and interfering talkers. In this paper, we consider the use of cross-correlation across frequency over some limited range of frequency channels in addition to the existing methods of monaural and binaural processing. This has the effect of locating and reinforcing coincident peaks across frequency over the representation of binaural interaction and provides local smoothing over the specified range of frequencies. Combined with the temporal and spatial masking techniques mentioned above, this leads to significant improvements in binaural speech recognition.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126263783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Najafian, Wei-Ning Hsu, Ahmed Ali, James R. Glass
{"title":"Automatic speech recognition of Arabic multi-genre broadcast media","authors":"M. Najafian, Wei-Ning Hsu, Ahmed Ali, James R. Glass","doi":"10.1109/ASRU.2017.8268957","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268957","url":null,"abstract":"This paper describes an Arabic Automatic Speech Recognition system developed on 15 hours of Multi-Genre Broadcast (MGB-3) data from YouTube, plus 1,200 hours of Multi-Dialect and Multi-Genre MGB-2 data recorded from the Aljazeera Arabic TV channel. In this paper, we report our investigations of a range of signal pre-processing, data augmentation, topic-specific language model adaptation, accent specific re-training, and deep learning based acoustic modeling topologies, such as feed-forward Deep Neural Networks (DNNs), Time-delay Neural Networks (TDNNs), Long Short-term Memory (LSTM) networks, Bidirectional LSTMs (BLSTMs), and a Bidirectional version of the Prioritized Grid LSTM (BPGLSTM) model. We propose a system combination for three purely sequence trained recognition systems based on lattice-free maximum mutual information, 4-gram language model re-scoring, and system combination using the minimum Bayes risk decoding criterion. The best word error rate we obtained on the MGB-3 Arabic development set using a 4-gram re-scoring strategy is 42.25% for a chain BLSTM system, compared to 65.44% baseline for a DNN system.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129470727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tomoki Hayashi, Akira Tamamori, Kazuhiro Kobayashi, K. Takeda, T. Toda
{"title":"An investigation of multi-speaker training for wavenet vocoder","authors":"Tomoki Hayashi, Akira Tamamori, Kazuhiro Kobayashi, K. Takeda, T. Toda","doi":"10.1109/ASRU.2017.8269007","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269007","url":null,"abstract":"In this paper, we investigate the effectiveness of multi-speaker training for WaveNet vocoder. In our previous work, we have demonstrated that our proposed speaker-dependent (SD) WaveNet vocoder, which is trained with a single speaker's speech data, is capable of modeling temporal waveform structure, such as phase information, and makes it possible to generate more naturally sounding synthetic voices compared to conventional high-quality vocoder, STRAIGHT. However, it is still difficult to generate synthetic voices of various speakers using the SD-WaveNet due to its speaker-dependent property. Towards the development of speaker-independent WaveNet vocoder, we apply multi-speaker training techniques to the WaveNet vocoder and investigate its effectiveness. The experimental results demonstrate that 1) the multispeaker WaveNet vocoder still outperforms STRAIGHT in generating known speakers' voices but it is comparable to STRAIGHT in generating unknown speakers' voices, and 2) the multi-speaker training is effective for developing the WaveNet vocoder capable of speech modification.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133884138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DBLSTM based multilingual articulatory feature extraction for language documentation","authors":"Markus Müller, Sebastian Stüker, A. Waibel","doi":"10.1109/ASRU.2017.8268966","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268966","url":null,"abstract":"With more than 7,000 living languages in the world and many of them facing extinction, the need for language documentation is now more pressing than ever. This process is time-consuming, requiring linguists as each language features peculiarities that need to be addressed. While automating the whole process is difficult, we aim at providing methods to support linguists during documentation. One important step in the workflow is the discovery of the phonetic inventory. In the past, we proposed a first approach of first automatically segmenting recordings into phone-line units and second clustering these segments based on acoustic similarity, determined by articulatory features (AFs). We now propose a refined method using Deep Bi-directional LSTMs (DBLSTMs) over DNNs. Additionally, we use Language Feature Vectors (LFVs) which encode language specific peculiarities in a low dimensional representation. In contrast to adding LFVs to the acoustic input features, we modulated the output of the last hidden LSTM layer, forcing groups of LSTM cells to adapt to language related features. We evaluated our approach multilingually, using data from multiple languages. Results show an improvement in recognition accuracy across AF types: While LFVs improved the performance of DNNs, the gain is even bigger when using DBLSTMs.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123608171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tien-Hong Lo, Ying-Wen Chen, Kuan-Yu Chen, H. Wang, Berlin Chen
{"title":"Neural relevance-aware query modeling for spoken document retrieval","authors":"Tien-Hong Lo, Ying-Wen Chen, Kuan-Yu Chen, H. Wang, Berlin Chen","doi":"10.1109/ASRU.2017.8268973","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268973","url":null,"abstract":"Spoken document retrieval (SDR) is becoming a much-needed application due to that unprecedented volumes of audio-visual media have been made available in our daily life. As far as we are aware, most of the wide variety of SDR methods mainly focus on exploring robust indexing and effective retrieval methods to quantify the relevance degree between a pair of query and document. However, similar to information retrieval (IR), a fundamental challenge facing SDR is that a query is usually too short to convey a user's information need, such that a retrieval system cannot always achieve prospective efficacy when with the existing retrieval methods. In order to further boost retrieval performance, several studies turn their attention to reformulating the original query by leveraging an online pseudo-relevance feedback (PRF) process, which often comes at the price of taking significant time. Motivated by these observations, this paper presents a novel extension of the general line of SDR research and its contribution is at least two-fold. First, building on neural network-based techniques, we put forward a neural relevance-aware query modeling (NRM) framework, which is designed to not only infer a discriminative query language model automatically for a given query, but also get around the time-consuming PRF process. Second, the utility of the methods instantiated from our proposed framework and several widely-used retrieval methods are extensively analyzed and compared on a standard SDR task, which suggests the superiority of our methods.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122184573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pegah Ghahremani, Vimal Manohar, Hossein Hadian, Daniel Povey, S. Khudanpur
{"title":"Investigation of transfer learning for ASR using LF-MMI trained neural networks","authors":"Pegah Ghahremani, Vimal Manohar, Hossein Hadian, Daniel Povey, S. Khudanpur","doi":"10.1109/ASRU.2017.8268947","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268947","url":null,"abstract":"It is common in applications of ASR to have a large amount of data out-of-domain to the test data and a smaller amount of in-domain data similar to the test data. In this paper, we investigate different ways to utilize this out-of-domain data to improve ASR models based on Lattice-free MMI (LF-MMI). In particular, we experiment with multi-task training using a network with shared hidden layers; and we try various ways of adapting previously trained models to a new domain. Both types of methods are effective in reducing the WER versus in-domain models, with the jointly trained models generally giving more improvement.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129084341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating native and non-native English classification and transfer effects using Legendre polynomial coefficient clustering","authors":"Rachel Rakov, A. Rosenberg","doi":"10.1109/ASRU.2017.8268996","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268996","url":null,"abstract":"In this paper, we investigate similarities and differences in pitch contours among native English speakers and non-native English speakers (whose first language is Mandarin). In particular, we investigate if there are particular prosodic contours that are predictive of native and non-native English speech in the area of question intonation contours. We also look to see if we find evidence of negative transfer effects or second language learning effects around native Mandarin speakers who may be using Mandarin prosody when speaking English. To investigate these questions, we explore prosodic contour modeling techniques for native and non-native English speech by clustering Legendre polynomial coefficients. Our results show evidence of non-native English speakers using unexpected contours in the place of expected English prosody. We additionally find support that speakers in our corpus may be experiencing negative language transfer effects, as well as second language learning effects.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126487375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The CMU entry to blizzard machine learning challenge","authors":"P. Baljekar, Sai Krishna Rallabandi, A. Black","doi":"10.1109/ASRU.2017.8268997","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268997","url":null,"abstract":"The paper describes Carnegie Mellon University's (CMU) entry to the ES-1 sub-task of the Blizzard Machine Learning Speech Synthesis Challenge 2017. The submitted system is a parametric model trained to predict vocoder parameters given linguistic features. The task in this year's challenge was to synthesize speech from children's audiobooks. Linguistic and acoustic features were provided by the organizers and the task was to find the best performing model. The paper explores various RNN architectures that were investigated and describes the final model that was submitted.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121188713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}