2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献

筛选
英文 中文
End-to-end text-independent speaker verification with flexibility in utterance duration 端到端文本独立的说话人验证,灵活的话语持续时间
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268989
Chunlei Zhang, K. Koishida
{"title":"End-to-end text-independent speaker verification with flexibility in utterance duration","authors":"Chunlei Zhang, K. Koishida","doi":"10.1109/ASRU.2017.8268989","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268989","url":null,"abstract":"We continue to investigate end-to-end text-independent speaker verification by incorporating the variability from different utterance durations. Our previous study [1] showed a competitive performance with a triplet loss based end-to-end text-independent speaker verification system. To normalize the duration variability, we provided fixed length inputs to the network by a simple cropping or padding operation. Those operations do not seem ideal, particularly for long duration where some amount of information is discarded, while an i-vector system typically has improved accuracy with an increase in input duration. In this study, we propose to replace the final max/average pooling layer with a Spatial Pyramid Pooling layer in the Inception-Resnet-v1 architecture, which allows us to relax the fixed-length input constraint and train the entire network with the arbitrary size of input in an end-to-end fashion. In this way, the modified network can map variable length utterances into fixed length embeddings. Experiments shows that the new end-to-end system with variable size input relatively reduces EER by 8.4% over the end-to-end system with fixed-length input, and 24.0% over the i-vector/PLDA baseline system. an end-to-end system with.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133491858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Syllable-based acoustic modeling with CTC-SMBR-LSTM 基于CTC-SMBR-LSTM的音节声学建模
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268932
Zhongdi Qu, Parisa Haghani, Eugene Weinstein, P. Moreno
{"title":"Syllable-based acoustic modeling with CTC-SMBR-LSTM","authors":"Zhongdi Qu, Parisa Haghani, Eugene Weinstein, P. Moreno","doi":"10.1109/ASRU.2017.8268932","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268932","url":null,"abstract":"We explore the feasibility of training long short-term memory (LSTM) recurrent neural networks (RNNs) with syllables, rather than phonemes, as outputs. Syllables are a natural choice of linguistic unit for modeling the acoustics of languages such as Mandarin Chinese, due to the inherent nature of the syllable as an elemental pronunciation construct and the limited size of the syllable set for such languages (around 1400 syllables for Mandarin). Our models are trained with Connectionist Temporal Classification (CTC) and state-level minimum Bayes risk (sMBR) loss using asynchronous stochastic gradient descent (ASGD) utilizing a parallel computation infrastructure for large-scale training. Our acoustic models operate on feature frames computed every 30ms, which makes them well suited for modeling syllables rather than phonemes, which can have a shorter duration. Additionally, when compared to wordlevel modeling, syllables have the advantage of avoiding out-of-vocabulary (OOV) model outputs. Our experiments on a Mandarin voice search task show that syllable-output models can perform better than context-independent (CI) phone-output models, and can give similar performance as our state-of-the-art context-dependent (CD) models. Additionally, decoding with syllable-output models is substantially faster than with CI models or with CD models. We demonstrate that these improvements are maintained when the model is trained to recognize both Mandarin syllables and English phonemes.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131487068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition 面向开放词汇端到端语音识别的多级语言建模与解码
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268948
Takaaki Hori, Shinji Watanabe, J. Hershey
{"title":"Multi-level language modeling and decoding for open vocabulary end-to-end speech recognition","authors":"Takaaki Hori, Shinji Watanabe, J. Hershey","doi":"10.1109/ASRU.2017.8268948","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268948","url":null,"abstract":"We propose a combination of character-based and word-based language models in an end-to-end automatic speech recognition (ASR) architecture. In our prior work, we combined a character-based LSTM RNN-LM with a hybrid attention/connectionist temporal classification (CTC) architecture. The character LMs improved recognition accuracy to rival state-of-the-art DNN/HMM systems in Japanese and Mandarin Chinese tasks. Although a character-based architecture can provide for open vocabulary recognition, the character-based LMs generally under-perform relative to word LMs for languages such as English with a small alphabet, because of the difficulty of modeling Linguistic constraints across long sequences of characters. This paper presents a novel method for end-to-end ASR decoding with LMs at both the character and word level. Hypotheses are first scored with the character-based LM until a word boundary is encountered. Known words are then re-scored using the word-based LM, while the character-based LM provides for out-of-vocabulary scores. In a standard Wall Street Journal (WSJ) task, we achieved 5.6 % WER for the Eval'92 test set using only the SI284 training set and WSJ text data, which is the best score reported for end-to-end ASR systems on this benchmark.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131600503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Multi-task ensembles with teacher-student training 多任务组合与师生培训
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268920
J. H. M. Wong, M. Gales
{"title":"Multi-task ensembles with teacher-student training","authors":"J. H. M. Wong, M. Gales","doi":"10.1109/ASRU.2017.8268920","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268920","url":null,"abstract":"Ensemble methods often yield significant gains for automatic speech recognition. One method to obtain a diverse ensemble is to separately train models with a range of context dependent targets, often implemented as state clusters. However, decoding the complete ensemble can be computationally expensive. To reduce this cost, the ensemble can be generated using a multi-task architecture. Here, the hidden layers are merged across all members of the ensemble, leaving only separate output layers for each set of targets. Previous investigations of this form of ensemble have used cross-entropy training, which is shown in this paper to produce only limited diversity between members of the ensemble. This paper extends the multi-task framework in several ways. First, the multi-task ensemble can be trained in a teacher-student fashion toward the ensemble of separate models, with the aim of increasing diversity. Second, the multi-task ensemble can be trained with a sequence discriminative criterion. Finally, a student model, with a single output layer, can be trained to emulate the combined ensemble, to further reduce the computational cost of decoding. These methods are evaluated on the Babel conversational telephone speech, AMI meeting transcription, and HUB4 English broadcast news tasks.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131091213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Direct modeling of raw audio with DNNS for wake word detection 原始音频的直接建模与DNNS唤醒词检测
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268943
K. Kumatani, S. Panchapagesan, Minhua Wu, Minjae Kim, N. Strom, Gautam Tiwari, Arindam Mandal
{"title":"Direct modeling of raw audio with DNNS for wake word detection","authors":"K. Kumatani, S. Panchapagesan, Minhua Wu, Minjae Kim, N. Strom, Gautam Tiwari, Arindam Mandal","doi":"10.1109/ASRU.2017.8268943","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268943","url":null,"abstract":"In this work, we develop a technique for training features directly from the single-channel speech waveform in order to improve wake word (WW) detection performance. Conventional speech recognition systems typically extract a compact feature representation based on prior knowledge such as log-mel filter bank energy (LFBE). Such a feature is then used for training a deep neural network (DNN) acoustic model (AM). In contrast, we directly train the WW DNN AM from the single-channel audio data in a stage-wise manner. We first build a feature extraction DNN with a small hidden bottleneck layer, and train this bottleneck feature representation using the same multi-task cross-entropy objective function as we use to train our WW DNNs. Then, the WW classification DNN is trained with input bottleneck features, keeping the feature extraction layers fixed. Finally, the feature extraction and classification DNNs are combined and then jointly optimized. We show the effectiveness of this stage-wise training technique through a set of experiments on real beam-formed far-field data. The experiment results show that the audioinput DNN provides significantly lower miss rates for a range of false alarm rates over the LFBE when a sufficient amount of training data is available, yielding approximately 12 % relative improvement in the area under the curve (AUC).","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115458849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Seeing and hearing too: Audio representation for video captioning 视觉和听觉:视频字幕的音频表示
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268961
Shun-Po Chuang, Chia-Hung Wan, Pang-Chi Huang, Chi-Yu Yang, Hung-yi Lee
{"title":"Seeing and hearing too: Audio representation for video captioning","authors":"Shun-Po Chuang, Chia-Hung Wan, Pang-Chi Huang, Chi-Yu Yang, Hung-yi Lee","doi":"10.1109/ASRU.2017.8268961","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268961","url":null,"abstract":"Video captioning has been widely researched. Most related work takes into account only visual content in generating descriptions. However, auditory content such as human speech or environmental sounds contains rich information for describing scenes, but has yet to be widely explored for video captions. Here, we experiment with different ways to use this auditory content in videos, and demonstrate improved caption generation in terms of popular evaluation methods such as BLEU, CIDEr, and METEOR. We also measure the semantic similarities between generated captions and human-provided ground truth using sentence embeddings, and find that good use of multi-modal contents helps the machine to generate captions that are more semantically related to the ground truth. When analyzing the generated sentences, we find some ambiguous situations for which visual-only models yield incorrect results but that are resolved by approaches that take into account auditory cues.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115507059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Improving the efficiency of forward-backward algorithm using batched computation in TensorFlow 在TensorFlow中使用批处理计算提高前向向后算法的效率
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268944
K. Sim, A. Narayanan, Tom Bagby, Tara N. Sainath, M. Bacchiani
{"title":"Improving the efficiency of forward-backward algorithm using batched computation in TensorFlow","authors":"K. Sim, A. Narayanan, Tom Bagby, Tara N. Sainath, M. Bacchiani","doi":"10.1109/ASRU.2017.8268944","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268944","url":null,"abstract":"Sequence-level losses are commonly used to train deep neural network acoustic models for automatic speech recognition. The forward-backward algorithm is used to efficiently compute the gradients of the sequence loss with respect to the model parameters. Gradient-based optimization is used to minimize these losses. Recent work has shown that the forward-backward algorithm can be efficiently implemented as a series of matrix operations. This paper further improves the forward-backward algorithm via batched computation, a technique commonly used to improve training speed by exploiting the parallel computation of matrix multiplication. Specifically, we show how batched computation of the forward-backward algorithm can be efficiently implemented using TensorFlow to handle variable-length sequences within a mini batch. Furthermore, we also show how the batched forward-backward computation can be used to compute the gradients of the connectionist temporal classification (CTC) and maximum mutual information (MMI) losses with respect to the logits. We show, via empirical benchmarks, that the batched forward-backward computation can speed up the CTC loss and gradient computation by about 183 times when run on GPU with a batch size of 256 compared to using a batch size of 1; and by about 22 times for lattice-free MMI using a trigram phone language model for the denominator.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124318563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Leveraging side information for speaker identification with the Enron conversational telephone speech collection 利用侧面信息来识别安然会话电话语音集的说话人
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268988
Ning Gao, Gregory Sell, Douglas W. Oard, Mark Dredze
{"title":"Leveraging side information for speaker identification with the Enron conversational telephone speech collection","authors":"Ning Gao, Gregory Sell, Douglas W. Oard, Mark Dredze","doi":"10.1109/ASRU.2017.8268988","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268988","url":null,"abstract":"Speaker identification experiments typically focus on acoustic signals, but conversational speech often occurs in settings where additional useful side information may be available. This paper introduces a new distributable speaker identification test collection based on recorded telephone calls of Enron energy traders. Experiments with these recordings demonstrate that social network features and recording channel metadata can be used to reduce error rates in speaker identification below that achieved using acoustic evidence alone. Social network features from the parallel Enron email collection (37 of the 41 speakers in the telephone recordings sent or received emails in the collection) improve speaker identification, as do social network features computed using lightly supervised techniques to estimate a social network from more than one thousand unlabeled recordings.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121328488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The zero resource speech challenge 2017 零资源演讲挑战2017
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268953
Maarten Versteegh, Roland Thiollière, Thomas Schatz, Xuan-Nga Cao, Xavier Anguera Miró, A. Jansen, Emmanuel Dupoux
{"title":"The zero resource speech challenge 2017","authors":"Maarten Versteegh, Roland Thiollière, Thomas Schatz, Xuan-Nga Cao, Xavier Anguera Miró, A. Jansen, Emmanuel Dupoux","doi":"10.1109/ASRU.2017.8268953","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268953","url":null,"abstract":"We describe a new challenge aimed at discovering subword and word units from raw speech. This challenge is the followup to the Zero Resource Speech Challenge 2015. It aims at constructing systems that generalize across languages and adapt to new speakers. The design features and evaluation metrics of the challenge are presented and the results of seventeen models are discussed.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124875151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 221
MGB-3 but system: Low-resource ASR on Egyptian YouTube data MGB-3但系统:埃及YouTube数据的低资源ASR
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268959
Karel Veselý, M. Baskar, M. Díez, Karel Beneš
{"title":"MGB-3 but system: Low-resource ASR on Egyptian YouTube data","authors":"Karel Veselý, M. Baskar, M. Díez, Karel Beneš","doi":"10.1109/ASRU.2017.8268959","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268959","url":null,"abstract":"This paper presents a series of experiments we performed during our work on the MGB-3 evaluations. We both describe the submitted system, as well as the post-evaluation analysis. Our initial BLSTM-HMM system was trained on 250 hours of MGB-2 data (Al-Jazeera), it was adapted with 5 hours of Egyptian data (YouTube). We included such techniques as diarization, n-gram language model adaptation, speed perturbation of the adaptation data, and the use of all 4 ‘correct’ references. The 4 references were either used for supervision with a ‘confusion network’, or we included each sentence 4x with the transcripts from all the annotators. Then, it was also helpful to blend the augmented MGB-3 adaptation data with 15 hours of MGB-2 data. Although we did not rank with our single system among the best teams in the evaluations, we believe that our analysis will be highly interesting not only for the other MGB-3 challenge participants.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125116518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信