Chanwoo Kim, Mehul Kumar, Kwangyoun Kim, Dhananjaya N. Gowda
{"title":"Power-Law Nonlinearity with Maximally Uniform Distribution Criterion for Improved Neural Network Training in Automatic Speech Recognition","authors":"Chanwoo Kim, Mehul Kumar, Kwangyoun Kim, Dhananjaya N. Gowda","doi":"10.1109/ASRU46091.2019.9003973","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003973","url":null,"abstract":"In this paper, we describe the Maximum Uniformity of Distribution (MUD) algorithm with the power-law nonlinearity. In this approach, we hypothesize that neural network training will become more stable if feature distribution is not too much skewed. We propose two different types of MUD approaches: power function-based MUD and histogram-based MUD. In these approaches, we first obtain the mel filterbank coefficients and apply nonlinearity functions for each filterbank channel. With the power function-based MUD, we apply a power-function based nonlinearity where power function coefficients are chosen to maximize the likelihood assuming that nonlinearity outputs follow the uniform distribution. With the histogram-based MUD, the empirical Cumulative Density Function (CDF) from the training database is employed to transform the original distribution into a uniform distribution. In MUD processing, we do not use any prior knowledge (e.g. logarithmic relation) about the energy of the incoming signal and the perceived intensity by a human. Experimental results using an end-to-end speech recognition system demonstrate that power-function based MUD shows better result than the conventional Mel Filterbank Cepstral Coefficients (MFCCs). On the LibriSpeech database, we could achieve 4.02 % WER on test-clean and 13.34 % WER on test-other without using any Language Models (LMs). The major contribution of this work is that we developed a new algorithm for designing the compressive nonlinearity in a data-driven way, which is much more flexible than the previous approaches and may be extended to other domains as well.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125053776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaochun An, Yuxuan Wang, Shan Yang, Zejun Ma, Lei Xie
{"title":"Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis","authors":"Xiaochun An, Yuxuan Wang, Shan Yang, Zejun Ma, Lei Xie","doi":"10.1109/ASRU46091.2019.9003859","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003859","url":null,"abstract":"Although Global Style Tokens (GSTs) are a recently-proposed method to uncover expressive factors of variation in speaking style, they are a mixture of style attributes without explicitly considering the factorization of multiple-level speaking styles. In this work, we introduce a hierarchical GST architecture with residuals to Tacotron, which learns multiple-level disentangled representations to model and control different style granularities in synthesized speech. We make hierarchical evaluations conditioned on individual tokens from different GST layers. As the number of layers increases, we tend to observe a coarse to fine style decomposition. For example, the first GST layer learns a good representation of speaker IDs while finer speaking style or emotion variations can be found in higher-level layers. Meanwhile, the proposed model shows good performance of style transfer.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123762457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Character-Aware Attention-Based End-to-End Speech Recognition","authors":"Zhong Meng, Yashesh Gaur, Jinyu Li, Y. Gong","doi":"10.1109/ASRU46091.2019.9004018","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004018","url":null,"abstract":"Predicting words and subword units (WSUs) as the output has shown to be effective for the attention-based encoder-decoder (AED) model in end-to-end speech recognition. However, as one input to the decoder recurrent neural network (RNN), each WSU embedding is learned independently through context and acoustic information in a purely data-driven fashion. Little effort has been made to explicitly model the morphological relationships among WSUs. In this work, we propose a novel character-aware (CA) AED model in which each WSU embedding is computed by summarizing the embeddings of its constituent characters using a CA-RNN. This WSU-independent CA-RNN is jointly trained with the encoder, the decoder and the attention network of a conventional AED to predict WSUs. With CA-AED, the embeddings of morphologically similar WSUs are naturally and directly correlated through the CA-RNN in addition to the semantic and acoustic relations modeled by a traditional AED. Moreover, CA-AED significantly reduces the model parameters in a traditional AED by replacing the large pool of WSU embeddings with a much smaller set of character embeddings. On a 3400 hours Microsoft Cortana dataset, CA-AED achieves up to 11.9% relative WER improvement over a strong AED baseline with 27.1% fewer model parameters.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"319 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122703341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yinghui Huang, Samuel Thomas, Masayuki Suzuki, Zoltán Tüske, Larry Sansone, M. Picheny
{"title":"Semi-Supervised Training and Data Augmentation for Adaptation of Automatic Broadcast News Captioning Systems","authors":"Yinghui Huang, Samuel Thomas, Masayuki Suzuki, Zoltán Tüske, Larry Sansone, M. Picheny","doi":"10.1109/ASRU46091.2019.9003943","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003943","url":null,"abstract":"In this paper we present a comprehensive study on building and adapting deep neural network based speech recognition systems for automatic closed captioning. We develop the proposed systems by first building base automatic speech recognition (ASR) systems that are not specific to any particular show or station. These models are trained on nearly 6000 hours of broadcast news data using conventional hybrid and more recent attention based end-to-end acoustic models. We then employ various adaptation and data augmentation strategies to further improve the trained base models. We use 535 hours of data from two independent BN sources to study how the base models can be customized. We observe up to 32% relative improvement using the proposed techniques on test sets related to, but independent of the adaptation data. At these low word error rates (WERs), we believe the customized BN ASR systems can be used effectively for automatic closed captioning.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126570114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hardik B. Sailor, S. Deena, Md. Asif Jalal, R. Lileikyte, Thomas Hain
{"title":"Unsupervised Adaptation of Acoustic Models for ASR Using Utterance-Level Embeddings from Squeeze and Excitation Networks","authors":"Hardik B. Sailor, S. Deena, Md. Asif Jalal, R. Lileikyte, Thomas Hain","doi":"10.1109/ASRU46091.2019.9003755","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003755","url":null,"abstract":"This paper proposes the adaptation of neural network-based acoustic models using a Squeeze-and-Excitation (SE) network for automatic speech recognition (ASR). In particular, this work explores to use the SE network to learn utterance-level embeddings. The acoustic modelling is performed using Light Gated Recurrent Units (LiGRU). The utterance embed-dings are learned from hidden unit activations jointly with LiGRU and used to scale respective activations of hidden layers in the LiGRU network. The advantage of such approach is that it does not require domain labels, such as speakers and noise to be known in order to perform the adaptation, thereby providing unsupervised adaptation. The global average and attentive pooling are applied on hidden units to extract utterance-level information that represents the speakers and acoustic conditions. ASR experiments were carried out on the TIMIT and Aurora 4 corpora. The proposed model achieves better performance on both the datasets compared to their respective baselines with relative improvements of 5.59% and 5.54% for TIMIT and Aurora 4 database, respectively. These experiments show the potential of using the conditioning information learned via utterance embeddings in the SE network to adapt acoustic models for speakers, noise, and other acoustic conditions.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134536058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kwangyoun Kim, Kyungmin Lee, Dhananjaya N. Gowda, Junmo Park, Sungsoo Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung, Jungin Lee, Myoungji Han, Chanwoo Kim
{"title":"Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus","authors":"Kwangyoun Kim, Kyungmin Lee, Dhananjaya N. Gowda, Junmo Park, Sungsoo Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung, Jungin Lee, Myoungji Han, Chanwoo Kim","doi":"10.1109/ASRU46091.2019.9004027","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004027","url":null,"abstract":"In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pretraining and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132730909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simple Gated Convnet for Small Footprint Acoustic Modeling","authors":"Lukas Lee, Jinhwan Park, Wonyong Sung","doi":"10.1109/ASRU46091.2019.9003993","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003993","url":null,"abstract":"Acoustic modeling with recurrent neural networks has shown very good performance, especially for end-to-end speech recognition. However, most recurrent neural networks require sequential computation of the output, which results in large memory access overhead when implemented in embedded devices. Convolution-based sequential modeling does not suffer from this problem; however, the model usually requires a large number of parameters. We propose simple gated convolutional neural networks (Simple Gated ConvNet) for acoustic modeling and show that the network performs very well even when the number of parameters is fairly small, less than 3 million. The Simple Gated ConvNet (SGCN) is constructed by combining the simplest form of Gated ConvNet and one-dimensional (1-D) depthwise convolution. The model has been evaluated using the Wall Street Journal (WSJ) Corpus and has shown a performance competitive to RNN-based ones. The performance of the SGCN has also been evaluated using the LibriSpeech Corpus. The developed model was implemented in ARM CPU based systems and showed the real time factor (RTF) of around 0.05.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115930436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Atsushi Ando, Ryo Ishii, T. Oba, Ryuichiro Higashinaka
{"title":"Improving Speech-Based End-of-Turn Detection Via Cross-Modal Representation Learning with Punctuated Text Data","authors":"Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Atsushi Ando, Ryo Ishii, T. Oba, Ryuichiro Higashinaka","doi":"10.1109/ASRU46091.2019.9003816","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003816","url":null,"abstract":"This paper presents a novel training method for speech-based end-of-turn detection for which not only manually annotated speech data sets but also punctuated text data sets are utilized. The speech-based end-of-turn detection estimates whether a target speaker's utterance is ended or not using speech information. In previous studies, the speech-based end-of-turn detection models were trained using only speech data sets that contained manually annotated end-of-turn labels. However, since the amounts of annotated speech data sets are often limited, the end-of-turn detection models were unable to correctly handle a wide variety of speech patterns. In order to mitigate the data scarcity problem, our key idea is to leverage punctuated text data sets for building more effective speech-based end-of-turn detection. Therefore, the proposed method introduces cross-modal representation learning to construct a speech encoder and a text encoder that can map speech and text with the same lexical information into similar vector representations. This enables us to train speech-based end-of-turn detection models from the punctuated text data sets by tackling text-based sentence boundary detection. In experiments on contact center calls, we show that speech-based end-of-turn detection models using hierarchical recurrent neural networks can be improved through the use of punctuated text data sets.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124755639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speaker-Aware Speech-Transformer","authors":"Zhiyun Fan, Jie Li, Shiyu Zhou, Bo Xu","doi":"10.1109/ASRU46091.2019.9003844","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003844","url":null,"abstract":"Recently, end-to-end (E2E) models become a competitive alternative to the conventional hybrid automatic speech recognition (ASR) systems. However, they still suffer from speaker mismatch in training and testing condition. In this paper, we use Speech-Transformer (ST) as the study platform to investigate speaker aware training of E2E models. We propose a model called Speaker-Aware Speech-Transformer (SAST), which is a standard ST equipped with a speaker attention module (SAM). The SAM has a static speaker knowledge block (SKB) that is made of i-vectors. At each time step, the encoder output attends to the i-vectors in the block, and generates a weighted combined speaker embedding vector, which helps the model to normalize the speaker variations. The SAST model trained in this way becomes independent of specific training speakers and thus generalizes better to unseen testing speakers. We investigate different factors of SAM. Experimental results on the AISHELL-1 task show that SAST achieves a relative 6.5% CER reduction (CERR) over the speaker-independent (SI) baseline. Moreover, we demonstrate that SAST still works quite well even if the i-vectors in SKB all come from a different data source other than the acoustic training set.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129362252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Grapheme-to-Phoneme Conversion by Investigating Copying Mechanism in Recurrent Architectures","authors":"Abhishek Niranjan, M. Shaik","doi":"10.1109/ASRU46091.2019.9003729","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003729","url":null,"abstract":"Attention driven encoder-decoder architectures have become highly successful in various sequence-to-sequence learning tasks. We propose copy-augmented Bi-directional Long Short-Term Memory based Encoder-Decoder architecture for the Grapheme-to-Phoneme conversion. In Grapheme-to-Phoneme task, a number of character units in words possess high degree of similarity with some phoneme unit(s). Thus, we make an attempt to capture this characteristic using copy-augmented architecture. Our proposed model automatically learns to generate phoneme sequences during inference by copying source token embeddings to the decoder's output in a controlled manner. To our knowledge, this is the first time the copy-augmentation is being investigated for Grapheme-to-Phoneme conversion task. We validate our experiments over accented and non-accented publicly available CMU-Dict datasets and achieve State-of-The-Art performances in terms of both phoneme and word error rates. Further, we verify the applicability of our proposed approach on Hindi Lexicon and show that our model outperforms all recent State-of-The-Art results.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130168869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}