{"title":"End-to-end text-independent speaker verification with flexibility in utterance duration","authors":"Chunlei Zhang, K. Koishida","doi":"10.1109/ASRU.2017.8268989","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268989","url":null,"abstract":"We continue to investigate end-to-end text-independent speaker verification by incorporating the variability from different utterance durations. Our previous study [1] showed a competitive performance with a triplet loss based end-to-end text-independent speaker verification system. To normalize the duration variability, we provided fixed length inputs to the network by a simple cropping or padding operation. Those operations do not seem ideal, particularly for long duration where some amount of information is discarded, while an i-vector system typically has improved accuracy with an increase in input duration. In this study, we propose to replace the final max/average pooling layer with a Spatial Pyramid Pooling layer in the Inception-Resnet-v1 architecture, which allows us to relax the fixed-length input constraint and train the entire network with the arbitrary size of input in an end-to-end fashion. In this way, the modified network can map variable length utterances into fixed length embeddings. Experiments shows that the new end-to-end system with variable size input relatively reduces EER by 8.4% over the end-to-end system with fixed-length input, and 24.0% over the i-vector/PLDA baseline system. an end-to-end system with.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133491858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhongdi Qu, Parisa Haghani, Eugene Weinstein, P. Moreno
{"title":"Syllable-based acoustic modeling with CTC-SMBR-LSTM","authors":"Zhongdi Qu, Parisa Haghani, Eugene Weinstein, P. Moreno","doi":"10.1109/ASRU.2017.8268932","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268932","url":null,"abstract":"We explore the feasibility of training long short-term memory (LSTM) recurrent neural networks (RNNs) with syllables, rather than phonemes, as outputs. Syllables are a natural choice of linguistic unit for modeling the acoustics of languages such as Mandarin Chinese, due to the inherent nature of the syllable as an elemental pronunciation construct and the limited size of the syllable set for such languages (around 1400 syllables for Mandarin). Our models are trained with Connectionist Temporal Classification (CTC) and state-level minimum Bayes risk (sMBR) loss using asynchronous stochastic gradient descent (ASGD) utilizing a parallel computation infrastructure for large-scale training. Our acoustic models operate on feature frames computed every 30ms, which makes them well suited for modeling syllables rather than phonemes, which can have a shorter duration. Additionally, when compared to wordlevel modeling, syllables have the advantage of avoiding out-of-vocabulary (OOV) model outputs. Our experiments on a Mandarin voice search task show that syllable-output models can perform better than context-independent (CI) phone-output models, and can give similar performance as our state-of-the-art context-dependent (CD) models. Additionally, decoding with syllable-output models is substantially faster than with CI models or with CD models. We demonstrate that these improvements are maintained when the model is trained to recognize both Mandarin syllables and English phonemes.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131487068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition","authors":"Takaaki Hori, Shinji Watanabe, J. Hershey","doi":"10.1109/ASRU.2017.8268948","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268948","url":null,"abstract":"We propose a combination of character-based and word-based language models in an end-to-end automatic speech recognition (ASR) architecture. In our prior work, we combined a character-based LSTM RNN-LM with a hybrid attention/connectionist temporal classification (CTC) architecture. The character LMs improved recognition accuracy to rival state-of-the-art DNN/HMM systems in Japanese and Mandarin Chinese tasks. Although a character-based architecture can provide for open vocabulary recognition, the character-based LMs generally under-perform relative to word LMs for languages such as English with a small alphabet, because of the difficulty of modeling Linguistic constraints across long sequences of characters. This paper presents a novel method for end-to-end ASR decoding with LMs at both the character and word level. Hypotheses are first scored with the character-based LM until a word boundary is encountered. Known words are then re-scored using the word-based LM, while the character-based LM provides for out-of-vocabulary scores. In a standard Wall Street Journal (WSJ) task, we achieved 5.6 % WER for the Eval'92 test set using only the SI284 training set and WSJ text data, which is the best score reported for end-to-end ASR systems on this benchmark.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131600503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-task ensembles with teacher-student training","authors":"J. H. M. Wong, M. Gales","doi":"10.1109/ASRU.2017.8268920","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268920","url":null,"abstract":"Ensemble methods often yield significant gains for automatic speech recognition. One method to obtain a diverse ensemble is to separately train models with a range of context dependent targets, often implemented as state clusters. However, decoding the complete ensemble can be computationally expensive. To reduce this cost, the ensemble can be generated using a multi-task architecture. Here, the hidden layers are merged across all members of the ensemble, leaving only separate output layers for each set of targets. Previous investigations of this form of ensemble have used cross-entropy training, which is shown in this paper to produce only limited diversity between members of the ensemble. This paper extends the multi-task framework in several ways. First, the multi-task ensemble can be trained in a teacher-student fashion toward the ensemble of separate models, with the aim of increasing diversity. Second, the multi-task ensemble can be trained with a sequence discriminative criterion. Finally, a student model, with a single output layer, can be trained to emulate the combined ensemble, to further reduce the computational cost of decoding. These methods are evaluated on the Babel conversational telephone speech, AMI meeting transcription, and HUB4 English broadcast news tasks.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131091213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Kumatani, S. Panchapagesan, Minhua Wu, Minjae Kim, N. Strom, Gautam Tiwari, Arindam Mandal
{"title":"Direct modeling of raw audio with DNNS for wake word detection","authors":"K. Kumatani, S. Panchapagesan, Minhua Wu, Minjae Kim, N. Strom, Gautam Tiwari, Arindam Mandal","doi":"10.1109/ASRU.2017.8268943","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268943","url":null,"abstract":"In this work, we develop a technique for training features directly from the single-channel speech waveform in order to improve wake word (WW) detection performance. Conventional speech recognition systems typically extract a compact feature representation based on prior knowledge such as log-mel filter bank energy (LFBE). Such a feature is then used for training a deep neural network (DNN) acoustic model (AM). In contrast, we directly train the WW DNN AM from the single-channel audio data in a stage-wise manner. We first build a feature extraction DNN with a small hidden bottleneck layer, and train this bottleneck feature representation using the same multi-task cross-entropy objective function as we use to train our WW DNNs. Then, the WW classification DNN is trained with input bottleneck features, keeping the feature extraction layers fixed. Finally, the feature extraction and classification DNNs are combined and then jointly optimized. We show the effectiveness of this stage-wise training technique through a set of experiments on real beam-formed far-field data. The experiment results show that the audioinput DNN provides significantly lower miss rates for a range of false alarm rates over the LFBE when a sufficient amount of training data is available, yielding approximately 12 % relative improvement in the area under the curve (AUC).","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115458849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shun-Po Chuang, Chia-Hung Wan, Pang-Chi Huang, Chi-Yu Yang, Hung-yi Lee
{"title":"Seeing and hearing too: Audio representation for video captioning","authors":"Shun-Po Chuang, Chia-Hung Wan, Pang-Chi Huang, Chi-Yu Yang, Hung-yi Lee","doi":"10.1109/ASRU.2017.8268961","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268961","url":null,"abstract":"Video captioning has been widely researched. Most related work takes into account only visual content in generating descriptions. However, auditory content such as human speech or environmental sounds contains rich information for describing scenes, but has yet to be widely explored for video captions. Here, we experiment with different ways to use this auditory content in videos, and demonstrate improved caption generation in terms of popular evaluation methods such as BLEU, CIDEr, and METEOR. We also measure the semantic similarities between generated captions and human-provided ground truth using sentence embeddings, and find that good use of multi-modal contents helps the machine to generate captions that are more semantically related to the ground truth. When analyzing the generated sentences, we find some ambiguous situations for which visual-only models yield incorrect results but that are resolved by approaches that take into account auditory cues.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115507059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Sim, A. Narayanan, Tom Bagby, Tara N. Sainath, M. Bacchiani
{"title":"Improving the efficiency of forward-backward algorithm using batched computation in TensorFlow","authors":"K. Sim, A. Narayanan, Tom Bagby, Tara N. Sainath, M. Bacchiani","doi":"10.1109/ASRU.2017.8268944","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268944","url":null,"abstract":"Sequence-level losses are commonly used to train deep neural network acoustic models for automatic speech recognition. The forward-backward algorithm is used to efficiently compute the gradients of the sequence loss with respect to the model parameters. Gradient-based optimization is used to minimize these losses. Recent work has shown that the forward-backward algorithm can be efficiently implemented as a series of matrix operations. This paper further improves the forward-backward algorithm via batched computation, a technique commonly used to improve training speed by exploiting the parallel computation of matrix multiplication. Specifically, we show how batched computation of the forward-backward algorithm can be efficiently implemented using TensorFlow to handle variable-length sequences within a mini batch. Furthermore, we also show how the batched forward-backward computation can be used to compute the gradients of the connectionist temporal classification (CTC) and maximum mutual information (MMI) losses with respect to the logits. We show, via empirical benchmarks, that the batched forward-backward computation can speed up the CTC loss and gradient computation by about 183 times when run on GPU with a batch size of 256 compared to using a batch size of 1; and by about 22 times for lattice-free MMI using a trigram phone language model for the denominator.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124318563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ning Gao, Gregory Sell, Douglas W. Oard, Mark Dredze
{"title":"Leveraging side information for speaker identification with the Enron conversational telephone speech collection","authors":"Ning Gao, Gregory Sell, Douglas W. Oard, Mark Dredze","doi":"10.1109/ASRU.2017.8268988","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268988","url":null,"abstract":"Speaker identification experiments typically focus on acoustic signals, but conversational speech often occurs in settings where additional useful side information may be available. This paper introduces a new distributable speaker identification test collection based on recorded telephone calls of Enron energy traders. Experiments with these recordings demonstrate that social network features and recording channel metadata can be used to reduce error rates in speaker identification below that achieved using acoustic evidence alone. Social network features from the parallel Enron email collection (37 of the 41 speakers in the telephone recordings sent or received emails in the collection) improve speaker identification, as do social network features computed using lightly supervised techniques to estimate a social network from more than one thousand unlabeled recordings.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121328488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maarten Versteegh, Roland Thiollière, Thomas Schatz, Xuan-Nga Cao, Xavier Anguera Miró, A. Jansen, Emmanuel Dupoux
{"title":"The zero resource speech challenge 2017","authors":"Maarten Versteegh, Roland Thiollière, Thomas Schatz, Xuan-Nga Cao, Xavier Anguera Miró, A. Jansen, Emmanuel Dupoux","doi":"10.1109/ASRU.2017.8268953","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268953","url":null,"abstract":"We describe a new challenge aimed at discovering subword and word units from raw speech. This challenge is the followup to the Zero Resource Speech Challenge 2015. It aims at constructing systems that generalize across languages and adapt to new speakers. The design features and evaluation metrics of the challenge are presented and the results of seventeen models are discussed.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124875151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MGB-3 but system: Low-resource ASR on Egyptian YouTube data","authors":"Karel Veselý, M. Baskar, M. Díez, Karel Beneš","doi":"10.1109/ASRU.2017.8268959","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268959","url":null,"abstract":"This paper presents a series of experiments we performed during our work on the MGB-3 evaluations. We both describe the submitted system, as well as the post-evaluation analysis. Our initial BLSTM-HMM system was trained on 250 hours of MGB-2 data (Al-Jazeera), it was adapted with 5 hours of Egyptian data (YouTube). We included such techniques as diarization, n-gram language model adaptation, speed perturbation of the adaptation data, and the use of all 4 ‘correct’ references. The 4 references were either used for supervision with a ‘confusion network’, or we included each sentence 4x with the transcripts from all the annotators. Then, it was also helpful to blend the augmented MGB-3 adaptation data with 15 hours of MGB-2 data. Although we did not rank with our single system among the best teams in the evaluations, we believe that our analysis will be highly interesting not only for the other MGB-3 challenge participants.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125116518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}