2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献

筛选
英文 中文
Hierarchical recurrent neural network for story segmentation using fusion of lexical and acoustic features 基于词法和声学特征融合的分层递归神经网络故事分割
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268981
E. Tsunoo, Ondrej Klejch, P. Bell, S. Renals
{"title":"Hierarchical recurrent neural network for story segmentation using fusion of lexical and acoustic features","authors":"E. Tsunoo, Ondrej Klejch, P. Bell, S. Renals","doi":"10.1109/ASRU.2017.8268981","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268981","url":null,"abstract":"A broadcast news stream consists of a number of stories and it is an important task to find the boundaries of stories automatically in news analysis. We capture the topic structure using a hierarchical model based on a Recurrent Neural Network (RNN) sentence modeling layer and a bidirectional Long Short-Term Memory (LSTM) topic modeling layer, with a fusion of acoustic and lexical features. Both features are accumulated with RNNs and trained jointly within the model to be fused at the sentence level. We conduct experiments on the topic detection and tracking (TDT4) task comparing combinations of two modalities trained with limited amount of parallel data. Further we utilize additional sufficient text data for training to polish our model. Experimental results indicate that the hierarchical RNN topic modeling takes advantage of the fusion scheme, especially with additional text training data, with a higher F1-measure compared to conventional state-of-the-art methods.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122079285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Leveraging native language speech for accent identification using deep Siamese networks 利用深层暹罗网络利用母语语音进行口音识别
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268994
Aditya Siddhant, P. Jyothi, Sriram Ganapathy
{"title":"Leveraging native language speech for accent identification using deep Siamese networks","authors":"Aditya Siddhant, P. Jyothi, Sriram Ganapathy","doi":"10.1109/ASRU.2017.8268994","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268994","url":null,"abstract":"The problem of automatic accent identification is important for several applications like speaker profiling and recognition as well as for improving speech recognition systems. The accented nature of speech can be primarily attributed to the influence of the speaker's native language on the given speech recording. In this paper, we propose a novel accent identification system whose training exploits speech in native languages along with the accented speech. Specifically, we develop a deep Siamese network based model which learns the association between accented speech recordings and the native language speech recordings. The Siamese networks are trained with i-vector features extracted from the speech recordings using either an unsupervised Gaussian mixture model (GMM) or a supervised deep neural network (DNN) model. We perform several accent identification experiments using the CSLU Foreign Accented English (FAE) corpus. In these experiments, our proposed approach using deep Siamese networks yield significant relative performance improvements of 15.4% on a 10-class accent identification task, over a baseline DNN-based classification system that uses GMM i-vectors. Furthermore, we present a detailed error analysis of the proposed accent identification system.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133957715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Turbo fusion of magnitude and phase information for DNN-based phoneme recognition 基于深度神经网络的音素识别中幅度和相位信息的Turbo融合
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268925
Timo Lohrenz, T. Fingscheidt
{"title":"Turbo fusion of magnitude and phase information for DNN-based phoneme recognition","authors":"Timo Lohrenz, T. Fingscheidt","doi":"10.1109/ASRU.2017.8268925","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268925","url":null,"abstract":"In this work we propose the so-called turbo fusion as competitive method for information fusion of Mel-filterbank magnitude and phase feature streams for automatic speech recognition (ASR). Based on the recently introduced turbo ASR paradigm, our contribution is fourfold: First, we introduce DNN-based acoustic modeling into turbo ASR, then we take steps towards LVCSR by omitting the costly state space transform and by investigating the classical TIMIT phoneme recognition task. Finally, replacing the typical stream weighting in fusion methods, we introduce a new dynamic range limitation of the exchanged posteriors between the involved magnitude and phase recognizers, resulting in a smoother information exchange. The proposed turbo fusion outperforms classical benchmarks on the TIMIT dataset both with and without dropout in DNN training, and also is first if compared to several state-of-the-art reference fusion methods.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130478115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Semi-supervised training strategies for deep neural networks 深度神经网络的半监督训练策略
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268919
Matthew Gibson, G. Cook, P. Zhan
{"title":"Semi-supervised training strategies for deep neural networks","authors":"Matthew Gibson, G. Cook, P. Zhan","doi":"10.1109/ASRU.2017.8268919","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268919","url":null,"abstract":"Use of both manually and automatically labelled data for model training is referred to as semi-supervised training. While semi-supervised acoustic model training has been well-explored in the context of hidden Markov Gaussian mixture models (HMM-GMMs), the re-emergence of deep neural network (DNN) acoustic models has given rise to some novel approaches to semi-supervised DNN training. This paper investigates several different strategies for semi-supervised DNN training, including the so-called ‘shared hidden layer’ approach and the ‘knowledge distillation’ (or student-teacher) approach. Particular attention is paid to the differing behaviour of semi-supervised DNN training methods during the cross-entropy and sequence training phases of model building. Experimental results on our internal study dataset provide evidence that in a low-resource scenario the most effective semi-supervised training strategy is ‘naive CE’ (treating manually transcribed and automatically transcribed data identically during the cross entropy phase of training) followed by use of a shared hidden layer technique during sequence training.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123282118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Spoofing detection via simultaneous verification of audio-visual synchronicity and transcription 欺骗检测通过同时验证视听同步性和转录
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268990
Lea Schonherr, Steffen Zeiler, D. Kolossa
{"title":"Spoofing detection via simultaneous verification of audio-visual synchronicity and transcription","authors":"Lea Schonherr, Steffen Zeiler, D. Kolossa","doi":"10.1109/ASRU.2017.8268990","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268990","url":null,"abstract":"Acoustic speaker recognition systems are very vulnerable to spoofing attacks via replayed or synthesized utterances. One possible countermeasure is audio-visual speaker recognition. Nevertheless, the addition of the visual stream alone does not prevent spoofing attacks completely and only provides further information to assess the authenticity of the utterance. Many systems consider audio and video modalities independently and can easily be spoofed by imitating only a single modality or by a bimodal replay attack with a victim's photograph or video. Therefore, we propose the simultaneous verification of the data synchronicity and the transcription in a challenge-response setup. We use coupled hidden Markov models (CHMMs) for a text-dependent spoofing detection and introduce new features that provide information about the transcriptions of the utterance and the synchronicity of both streams. We evaluate the features for various spoofing scenarios and show that the combination of the features leads to a more robust recognition, also in comparison to the baseline method. Additionally, by evaluating the data on unseen speakers, we show the spoofing detection to be applicable in speaker-independent use-cases.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114929895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer 利用rnn -换能器探索流端到端语音识别的架构、数据和单元
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268935
Kanishka Rao, H. Sak, Rohit Prabhavalkar
{"title":"Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer","authors":"Kanishka Rao, H. Sak, Rohit Prabhavalkar","doi":"10.1109/ASRU.2017.8268935","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268935","url":null,"abstract":"We investigate training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture which jointly learns acoustic and language model components from transcribed acoustic data. We explore various model architectures and demonstrate how the model can be improved further if additional text or pronunciation data are available. The model consists of an ‘encoder’, which is initialized from a connectionist temporal classification-based (CTC) acoustic model, and a ‘decoder’ which is partially initialized from a recurrent neural network language model trained on text data alone. The entire neural network is trained with the RNN-T loss and directly outputs the recognized transcript as a sequence of graphemes, thus performing end-to-end speech recognition. We find that performance can be improved further through the use of sub-word units ('wordpieces') which capture longer context and significantly reduce substitution errors. The best RNN-T system, a twelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000 wordpieces as output targets achieves a word error rate of 8.5% on voice-search and 5.2% on voice-dictation tasks and is comparable to a state-of-the-art baseline at 8.3% on voice-search and 5.4% voice-dictation.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121494930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 313
The blizzard machine learning challenge 2017 2017暴雪机器学习挑战赛
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268954
Kei Sawada, K. Tokuda, Simon King, A. Black
{"title":"The blizzard machine learning challenge 2017","authors":"Kei Sawada, K. Tokuda, Simon King, A. Black","doi":"10.1109/ASRU.2017.8268954","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268954","url":null,"abstract":"This paper describes the Blizzard Machine Learning Challenge (BMLC) 2017, which is a spin-off of the Blizzard Challenge. The annual Blizzard Challenges 2005–2017 were held to better understand and compare research techniques in building corpus-based text-to-speech (TTS) systems on the same data. The series of Blizzard Challenges has helped us measure progress in TTS technology. However, to get competitive performance, a lot time has to be spent on skilled tasks. This may make the Blizzard Challenge unattractive to machine learning researchers from other fields. Therefore, we recommend that the BMLC not involve these speech-specific tasks and that it allow participants to concentrate on the acoustic modeling task, framed as a straightforward machine learning problem, with a fixed dataset. In the BMLC 2017, two types of datasets consisting of four hours of speech data suitable for machine learning problems were distributed. This paper summarizes the purpose, design, and whole process of the challenge and its results.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114322824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Aalto system for the 2017 Arabic multi-genre broadcast challenge Aalto系统为2017年阿拉伯语多类型广播挑战赛
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268955
Peter Smit, Siva Charan Reddy Gangireddy, Seppo Enarvi, Sami Virpioja, M. Kurimo
{"title":"Aalto system for the 2017 Arabic multi-genre broadcast challenge","authors":"Peter Smit, Siva Charan Reddy Gangireddy, Seppo Enarvi, Sami Virpioja, M. Kurimo","doi":"10.1109/ASRU.2017.8268955","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268955","url":null,"abstract":"We describe the speech recognition systems we have created for MGB-3, the 3rd Multi Genre Broadcast challenge, which this year consisted of a task of building a system for transcribing Egyptian Dialect Arabic speech, using a big audio corpus of primarily Modern Standard Arabic speech and only a small amount (5 hours) of Egyptian adaptation data. Our system, which was a combination of different acoustic models, language models and lexical units, achieved a Multi-Reference Word Error Rate of 29.25%, which was the lowest in the competition. Also on the old MGB-2 task, which was run again to indicate progress, we achieved the lowest error rate: 13.2%. The result is a combination of the application of state-of-the-art speech recognition methods such as simple dialect adaptation for a Time-Delay Neural Network (TDNN) acoustic model (−27% errors compared to the baseline), Recurrent Neural Network Language Model (RNNLM) rescoring (an additional −5%), and system combination with Minimum Bayes Risk (MBR) decoding (yet another −10%). We also explored the use of morph and character language models, which was particularly beneficial in providing a rich pool of systems for the MBR decoding.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"210 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133027468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Language independent end-to-end architecture for joint language identification and speech recognition 用于联合语言识别和语音识别的独立于语言的端到端体系结构
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268945
Shinji Watanabe, Takaaki Hori, J. Hershey
{"title":"Language independent end-to-end architecture for joint language identification and speech recognition","authors":"Shinji Watanabe, Takaaki Hori, J. Hershey","doi":"10.1109/ASRU.2017.8268945","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268945","url":null,"abstract":"End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity, which we fully exploit in this paper, to build a monolithic multilingual ASR system with a language-independent neural network architecture. We present a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition. The model is based on our hybrid attention/connectionist temporal classification (CTC) architecture which has previously been shown to achieve the state-of-the-art performance in several ASR benchmarks. Here we augment its set of output symbols to include the union of character sets appearing in all the target languages. These include Roman and Cyrillic Alphabets, Arabic numbers, simplified Chinese, and Japanese Kanji/Hiragana/Katakana characters (5,500 characters in all). This allows training of a single multilingual model, whose parameters are shared across all the languages. The model can jointly identify the language and recognize the speech, automatically formatting the recognized text in the appropriate character set. The experiments, which used speech databases composed of Wall Street Journal (English), Corpus of Spontaneous Japanese, HKUST Mandarin CTS, and Voxforge (German, Spanish, French, Italian, Dutch, Portuguese, Russian), demonstrate comparable/superior performance relative to language-dependent end-to-end ASR systems.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134194249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 127
Future vector enhanced LSTM language model for LVCSR LVCSR的未来向量增强LSTM语言模型
2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268923
Qi Liu, Y. Qian, Kai Yu
{"title":"Future vector enhanced LSTM language model for LVCSR","authors":"Qi Liu, Y. Qian, Kai Yu","doi":"10.1109/ASRU.2017.8268923","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268923","url":null,"abstract":"Language models (LM) play an important role in large vocabulary continuous speech recognition (LVCSR). However, traditional language models only predict next single word with given history, while the consecutive predictions on a sequence of words are usually demanded and useful in LVCSR. The mismatch between the single word prediction modeling in trained and the long term sequence prediction in read demands may lead to the performance degradation. In this paper, a novel enhanced long short-term memory (LSTM) LM using the future vector is proposed. In addition to the given history, the rest of the sequence will be also embedded by future vectors. This future vector can be incorporated with the LSTM LM, so it has the ability to model much longer term sequence level information. Experiments show that, the proposed new LSTM LM gets a better result on BLEU scores for long term sequence prediction. For the speech recognition rescoring, although the proposed LSTM LM obtains very slight gains, the new model seems obtain the great complementary with the conventional LSTM LM. Rescoring using both the new and conventional LSTM LMs can achieve a very large improvement on the word error rate.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132256646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信