2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献

Power-Law Nonlinearity with Maximally Uniform Distribution Criterion for Improved Neural Network Training in Automatic Speech Recognition 基于最大均匀分布准则的幂律非线性改进神经网络自动语音识别训练

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003973

Chanwoo Kim, Mehul Kumar, Kwangyoun Kim, Dhananjaya N. Gowda

{"title":"Power-Law Nonlinearity with Maximally Uniform Distribution Criterion for Improved Neural Network Training in Automatic Speech Recognition","authors":"Chanwoo Kim, Mehul Kumar, Kwangyoun Kim, Dhananjaya N. Gowda","doi":"10.1109/ASRU46091.2019.9003973","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003973","url":null,"abstract":"In this paper, we describe the Maximum Uniformity of Distribution (MUD) algorithm with the power-law nonlinearity. In this approach, we hypothesize that neural network training will become more stable if feature distribution is not too much skewed. We propose two different types of MUD approaches: power function-based MUD and histogram-based MUD. In these approaches, we first obtain the mel filterbank coefficients and apply nonlinearity functions for each filterbank channel. With the power function-based MUD, we apply a power-function based nonlinearity where power function coefficients are chosen to maximize the likelihood assuming that nonlinearity outputs follow the uniform distribution. With the histogram-based MUD, the empirical Cumulative Density Function (CDF) from the training database is employed to transform the original distribution into a uniform distribution. In MUD processing, we do not use any prior knowledge (e.g. logarithmic relation) about the energy of the incoming signal and the perceived intensity by a human. Experimental results using an end-to-end speech recognition system demonstrate that power-function based MUD shows better result than the conventional Mel Filterbank Cepstral Coefficients (MFCCs). On the LibriSpeech database, we could achieve 4.02 % WER on test-clean and 13.34 % WER on test-other without using any Language Models (LMs). The major contribution of this work is that we developed a new algorithm for designing the compressive nonlinearity in a data-driven way, which is much more flexible than the previous approaches and may be extended to other domains as well.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125053776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Learning Hierarchical Representations for Expressive Speaking Style in End-to-End Speech Synthesis 学习端到端语音合成中表达性说话风格的层次表示

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003859

Xiaochun An, Yuxuan Wang, Shan Yang, Zejun Ma, Lei Xie

引用次数: 20

Character-Aware Attention-Based End-to-End Speech Recognition 基于字符感知注意力的端到端语音识别

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004018

Zhong Meng, Yashesh Gaur, Jinyu Li, Y. Gong

{"title":"Character-Aware Attention-Based End-to-End Speech Recognition","authors":"Zhong Meng, Yashesh Gaur, Jinyu Li, Y. Gong","doi":"10.1109/ASRU46091.2019.9004018","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004018","url":null,"abstract":"Predicting words and subword units (WSUs) as the output has shown to be effective for the attention-based encoder-decoder (AED) model in end-to-end speech recognition. However, as one input to the decoder recurrent neural network (RNN), each WSU embedding is learned independently through context and acoustic information in a purely data-driven fashion. Little effort has been made to explicitly model the morphological relationships among WSUs. In this work, we propose a novel character-aware (CA) AED model in which each WSU embedding is computed by summarizing the embeddings of its constituent characters using a CA-RNN. This WSU-independent CA-RNN is jointly trained with the encoder, the decoder and the attention network of a conventional AED to predict WSUs. With CA-AED, the embeddings of morphologically similar WSUs are naturally and directly correlated through the CA-RNN in addition to the semantic and acoustic relations modeled by a traditional AED. Moreover, CA-AED significantly reduces the model parameters in a traditional AED by replacing the large pool of WSU embeddings with a much smaller set of character embeddings. On a 3400 hours Microsoft Cortana dataset, CA-AED achieves up to 11.9% relative WER improvement over a strong AED baseline with 27.1% fewer model parameters.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"319 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122703341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Semi-Supervised Training and Data Augmentation for Adaptation of Automatic Broadcast News Captioning Systems 自动广播新闻字幕系统自适应的半监督训练和数据增强

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003943

Yinghui Huang, Samuel Thomas, Masayuki Suzuki, Zoltán Tüske, Larry Sansone, M. Picheny

引用次数: 4

Unsupervised Adaptation of Acoustic Models for ASR Using Utterance-Level Embeddings from Squeeze and Excitation Networks 基于挤压和激励网络话语级嵌入的ASR声学模型无监督自适应

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003755

Hardik B. Sailor, S. Deena, Md. Asif Jalal, R. Lileikyte, Thomas Hain

{"title":"Unsupervised Adaptation of Acoustic Models for ASR Using Utterance-Level Embeddings from Squeeze and Excitation Networks","authors":"Hardik B. Sailor, S. Deena, Md. Asif Jalal, R. Lileikyte, Thomas Hain","doi":"10.1109/ASRU46091.2019.9003755","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003755","url":null,"abstract":"This paper proposes the adaptation of neural network-based acoustic models using a Squeeze-and-Excitation (SE) network for automatic speech recognition (ASR). In particular, this work explores to use the SE network to learn utterance-level embeddings. The acoustic modelling is performed using Light Gated Recurrent Units (LiGRU). The utterance embed-dings are learned from hidden unit activations jointly with LiGRU and used to scale respective activations of hidden layers in the LiGRU network. The advantage of such approach is that it does not require domain labels, such as speakers and noise to be known in order to perform the adaptation, thereby providing unsupervised adaptation. The global average and attentive pooling are applied on hidden units to extract utterance-level information that represents the speakers and acoustic conditions. ASR experiments were carried out on the TIMIT and Aurora 4 corpora. The proposed model achieves better performance on both the datasets compared to their respective baselines with relative improvements of 5.59% and 5.54% for TIMIT and Aurora 4 database, respectively. These experiments show the potential of using the conditioning information learned via utterance embeddings in the SE network to adapt acoustic models for speakers, noise, and other acoustic conditions.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134536058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus 大型语料库下基于注意力的设备流语音识别

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9004027

Kwangyoun Kim, Kyungmin Lee, Dhananjaya N. Gowda, Junmo Park, Sungsoo Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung, Jungin Lee, Myoungji Han, Chanwoo Kim

{"title":"Attention Based On-Device Streaming Speech Recognition with Large Speech Corpus","authors":"Kwangyoun Kim, Kyungmin Lee, Dhananjaya N. Gowda, Junmo Park, Sungsoo Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung, Jungin Lee, Myoungji Han, Chanwoo Kim","doi":"10.1109/ASRU46091.2019.9004027","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9004027","url":null,"abstract":"In this paper, we present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus. We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER) training, layer-wise pretraining and data augmentation methods. In addition, we compressed our models by more than 3.4 times smaller using an iterative hyper low-rank approximation (LRA) method while minimizing the degradation in recognition accuracy. The memory footprint was further reduced with 8-bit quantization to bring down the final model size to lower than 39 MB. For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132730909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

Simple Gated Convnet for Small Footprint Acoustic Modeling 用于小足迹声学建模的简单门控Convnet

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003993

Lukas Lee, Jinhwan Park, Wonyong Sung

引用次数: 5

Improving Speech-Based End-of-Turn Detection Via Cross-Modal Representation Learning with Punctuated Text Data 基于标点文本数据的跨模态表示学习改进基于语音的回合结束检测

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003816

Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Atsushi Ando, Ryo Ishii, T. Oba, Ryuichiro Higashinaka

{"title":"Improving Speech-Based End-of-Turn Detection Via Cross-Modal Representation Learning with Punctuated Text Data","authors":"Ryo Masumura, Mana Ihori, Tomohiro Tanaka, Atsushi Ando, Ryo Ishii, T. Oba, Ryuichiro Higashinaka","doi":"10.1109/ASRU46091.2019.9003816","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003816","url":null,"abstract":"This paper presents a novel training method for speech-based end-of-turn detection for which not only manually annotated speech data sets but also punctuated text data sets are utilized. The speech-based end-of-turn detection estimates whether a target speaker's utterance is ended or not using speech information. In previous studies, the speech-based end-of-turn detection models were trained using only speech data sets that contained manually annotated end-of-turn labels. However, since the amounts of annotated speech data sets are often limited, the end-of-turn detection models were unable to correctly handle a wide variety of speech patterns. In order to mitigate the data scarcity problem, our key idea is to leverage punctuated text data sets for building more effective speech-based end-of-turn detection. Therefore, the proposed method introduces cross-modal representation learning to construct a speech encoder and a text encoder that can map speech and text with the same lexical information into similar vector representations. This enables us to train speech-based end-of-turn detection models from the punctuated text data sets by tackling text-based sentence boundary detection. In experiments on contact center calls, we show that speech-based end-of-turn detection models using hierarchical recurrent neural networks can be improved through the use of punctuated text data sets.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124755639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Speaker-Aware Speech-Transformer Speaker-Aware Speech-Transformer

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003844

Zhiyun Fan, Jie Li, Shiyu Zhou, Bo Xu

{"title":"Speaker-Aware Speech-Transformer","authors":"Zhiyun Fan, Jie Li, Shiyu Zhou, Bo Xu","doi":"10.1109/ASRU46091.2019.9003844","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003844","url":null,"abstract":"Recently, end-to-end (E2E) models become a competitive alternative to the conventional hybrid automatic speech recognition (ASR) systems. However, they still suffer from speaker mismatch in training and testing condition. In this paper, we use Speech-Transformer (ST) as the study platform to investigate speaker aware training of E2E models. We propose a model called Speaker-Aware Speech-Transformer (SAST), which is a standard ST equipped with a speaker attention module (SAM). The SAM has a static speaker knowledge block (SKB) that is made of i-vectors. At each time step, the encoder output attends to the i-vectors in the block, and generates a weighted combined speaker embedding vector, which helps the model to normalize the speaker variations. The SAST model trained in this way becomes independent of specific training speakers and thus generalizes better to unseen testing speakers. We investigate different factors of SAM. Experimental results on the AISHELL-1 task show that SAST achieves a relative 6.5% CER reduction (CERR) over the speaker-independent (SI) baseline. Moreover, we demonstrate that SAST still works quite well even if the i-vectors in SKB all come from a different data source other than the acoustic training set.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129362252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Improving Grapheme-to-Phoneme Conversion by Investigating Copying Mechanism in Recurrent Architectures 通过研究循环结构中的复制机制来改进字素到音素的转换

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI: 10.1109/ASRU46091.2019.9003729

Abhishek Niranjan, M. Shaik

{"title":"Improving Grapheme-to-Phoneme Conversion by Investigating Copying Mechanism in Recurrent Architectures","authors":"Abhishek Niranjan, M. Shaik","doi":"10.1109/ASRU46091.2019.9003729","DOIUrl":"https://doi.org/10.1109/ASRU46091.2019.9003729","url":null,"abstract":"Attention driven encoder-decoder architectures have become highly successful in various sequence-to-sequence learning tasks. We propose copy-augmented Bi-directional Long Short-Term Memory based Encoder-Decoder architecture for the Grapheme-to-Phoneme conversion. In Grapheme-to-Phoneme task, a number of character units in words possess high degree of similarity with some phoneme unit(s). Thus, we make an attempt to capture this characteristic using copy-augmented architecture. Our proposed model automatically learns to generate phoneme sequences during inference by copying source token embeddings to the decoder's output in a controlled manner. To our knowledge, this is the first time the copy-augmentation is being investigated for Grapheme-to-Phoneme conversion task. We validate our experiments over accented and non-accented publicly available CMU-Dict datasets and achieve State-of-The-Art performances in terms of both phoneme and word error rates. Further, we verify the applicability of our proposed approach on Hindi Lexicon and show that our model outperforms all recent State-of-The-Art results.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130168869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1