2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献_第7页

Hierarchical recurrent neural network for story segmentation using fusion of lexical and acoustic features 基于词法和声学特征融合的分层递归神经网络故事分割

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268981

E. Tsunoo, Ondrej Klejch, P. Bell, S. Renals

引用次数: 8

Leveraging native language speech for accent identification using deep Siamese networks 利用深层暹罗网络利用母语语音进行口音识别

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268994

Aditya Siddhant, P. Jyothi, Sriram Ganapathy

{"title":"Leveraging native language speech for accent identification using deep Siamese networks","authors":"Aditya Siddhant, P. Jyothi, Sriram Ganapathy","doi":"10.1109/ASRU.2017.8268994","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268994","url":null,"abstract":"The problem of automatic accent identification is important for several applications like speaker profiling and recognition as well as for improving speech recognition systems. The accented nature of speech can be primarily attributed to the influence of the speaker's native language on the given speech recording. In this paper, we propose a novel accent identification system whose training exploits speech in native languages along with the accented speech. Specifically, we develop a deep Siamese network based model which learns the association between accented speech recordings and the native language speech recordings. The Siamese networks are trained with i-vector features extracted from the speech recordings using either an unsupervised Gaussian mixture model (GMM) or a supervised deep neural network (DNN) model. We perform several accent identification experiments using the CSLU Foreign Accented English (FAE) corpus. In these experiments, our proposed approach using deep Siamese networks yield significant relative performance improvements of 15.4% on a 10-class accent identification task, over a baseline DNN-based classification system that uses GMM i-vectors. Furthermore, we present a detailed error analysis of the proposed accent identification system.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133957715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Turbo fusion of magnitude and phase information for DNN-based phoneme recognition 基于深度神经网络的音素识别中幅度和相位信息的Turbo融合

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268925

Timo Lohrenz, T. Fingscheidt

引用次数: 6

Semi-supervised training strategies for deep neural networks 深度神经网络的半监督训练策略

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268919

Matthew Gibson, G. Cook, P. Zhan

引用次数: 2

Spoofing detection via simultaneous verification of audio-visual synchronicity and transcription 欺骗检测通过同时验证视听同步性和转录

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268990

Lea Schonherr, Steffen Zeiler, D. Kolossa

{"title":"Spoofing detection via simultaneous verification of audio-visual synchronicity and transcription","authors":"Lea Schonherr, Steffen Zeiler, D. Kolossa","doi":"10.1109/ASRU.2017.8268990","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268990","url":null,"abstract":"Acoustic speaker recognition systems are very vulnerable to spoofing attacks via replayed or synthesized utterances. One possible countermeasure is audio-visual speaker recognition. Nevertheless, the addition of the visual stream alone does not prevent spoofing attacks completely and only provides further information to assess the authenticity of the utterance. Many systems consider audio and video modalities independently and can easily be spoofed by imitating only a single modality or by a bimodal replay attack with a victim's photograph or video. Therefore, we propose the simultaneous verification of the data synchronicity and the transcription in a challenge-response setup. We use coupled hidden Markov models (CHMMs) for a text-dependent spoofing detection and introduce new features that provide information about the transcriptions of the utterance and the synchronicity of both streams. We evaluate the features for various spoofing scenarios and show that the combination of the features leads to a more robust recognition, also in comparison to the baseline method. Additionally, by evaluating the data on unseen speakers, we show the spoofing detection to be applicable in speaker-independent use-cases.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114929895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer 利用rnn -换能器探索流端到端语音识别的架构、数据和单元

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268935

Kanishka Rao, H. Sak, Rohit Prabhavalkar

{"title":"Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer","authors":"Kanishka Rao, H. Sak, Rohit Prabhavalkar","doi":"10.1109/ASRU.2017.8268935","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268935","url":null,"abstract":"We investigate training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture which jointly learns acoustic and language model components from transcribed acoustic data. We explore various model architectures and demonstrate how the model can be improved further if additional text or pronunciation data are available. The model consists of an ‘encoder’, which is initialized from a connectionist temporal classification-based (CTC) acoustic model, and a ‘decoder’ which is partially initialized from a recurrent neural network language model trained on text data alone. The entire neural network is trained with the RNN-T loss and directly outputs the recognized transcript as a sequence of graphemes, thus performing end-to-end speech recognition. We find that performance can be improved further through the use of sub-word units ('wordpieces') which capture longer context and significantly reduce substitution errors. The best RNN-T system, a twelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000 wordpieces as output targets achieves a word error rate of 8.5% on voice-search and 5.2% on voice-dictation tasks and is comparable to a state-of-the-art baseline at 8.3% on voice-search and 5.4% voice-dictation.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121494930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 313

The blizzard machine learning challenge 2017 2017暴雪机器学习挑战赛

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268954

Kei Sawada, K. Tokuda, Simon King, A. Black

引用次数: 2

Aalto system for the 2017 Arabic multi-genre broadcast challenge Aalto系统为2017年阿拉伯语多类型广播挑战赛

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268955

Peter Smit, Siva Charan Reddy Gangireddy, Seppo Enarvi, Sami Virpioja, M. Kurimo

{"title":"Aalto system for the 2017 Arabic multi-genre broadcast challenge","authors":"Peter Smit, Siva Charan Reddy Gangireddy, Seppo Enarvi, Sami Virpioja, M. Kurimo","doi":"10.1109/ASRU.2017.8268955","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268955","url":null,"abstract":"We describe the speech recognition systems we have created for MGB-3, the 3rd Multi Genre Broadcast challenge, which this year consisted of a task of building a system for transcribing Egyptian Dialect Arabic speech, using a big audio corpus of primarily Modern Standard Arabic speech and only a small amount (5 hours) of Egyptian adaptation data. Our system, which was a combination of different acoustic models, language models and lexical units, achieved a Multi-Reference Word Error Rate of 29.25%, which was the lowest in the competition. Also on the old MGB-2 task, which was run again to indicate progress, we achieved the lowest error rate: 13.2%. The result is a combination of the application of state-of-the-art speech recognition methods such as simple dialect adaptation for a Time-Delay Neural Network (TDNN) acoustic model (−27% errors compared to the baseline), Recurrent Neural Network Language Model (RNNLM) rescoring (an additional −5%), and system combination with Minimum Bayes Risk (MBR) decoding (yet another −10%). We also explored the use of morph and character language models, which was particularly beneficial in providing a rich pool of systems for the MBR decoding.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"210 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133027468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Language independent end-to-end architecture for joint language identification and speech recognition 用于联合语言识别和语音识别的独立于语言的端到端体系结构

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268945

Shinji Watanabe, Takaaki Hori, J. Hershey

{"title":"Language independent end-to-end architecture for joint language identification and speech recognition","authors":"Shinji Watanabe, Takaaki Hori, J. Hershey","doi":"10.1109/ASRU.2017.8268945","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268945","url":null,"abstract":"End-to-end automatic speech recognition (ASR) can significantly reduce the burden of developing ASR systems for new languages, by eliminating the need for linguistic information such as pronunciation dictionaries. This also creates an opportunity, which we fully exploit in this paper, to build a monolithic multilingual ASR system with a language-independent neural network architecture. We present a model that can recognize speech in 10 different languages, by directly performing grapheme (character/chunked-character) based speech recognition. The model is based on our hybrid attention/connectionist temporal classification (CTC) architecture which has previously been shown to achieve the state-of-the-art performance in several ASR benchmarks. Here we augment its set of output symbols to include the union of character sets appearing in all the target languages. These include Roman and Cyrillic Alphabets, Arabic numbers, simplified Chinese, and Japanese Kanji/Hiragana/Katakana characters (5,500 characters in all). This allows training of a single multilingual model, whose parameters are shared across all the languages. The model can jointly identify the language and recognize the speech, automatically formatting the recognized text in the appropriate character set. The experiments, which used speech databases composed of Wall Street Journal (English), Corpus of Spontaneous Japanese, HKUST Mandarin CTS, and Voxforge (German, Spanish, French, Italian, Dutch, Portuguese, Russian), demonstrate comparable/superior performance relative to language-dependent end-to-end ASR systems.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134194249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 127

Future vector enhanced LSTM language model for LVCSR LVCSR的未来向量增强LSTM语言模型

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268923

Qi Liu, Y. Qian, Kai Yu

{"title":"Future vector enhanced LSTM language model for LVCSR","authors":"Qi Liu, Y. Qian, Kai Yu","doi":"10.1109/ASRU.2017.8268923","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268923","url":null,"abstract":"Language models (LM) play an important role in large vocabulary continuous speech recognition (LVCSR). However, traditional language models only predict next single word with given history, while the consecutive predictions on a sequence of words are usually demanded and useful in LVCSR. The mismatch between the single word prediction modeling in trained and the long term sequence prediction in read demands may lead to the performance degradation. In this paper, a novel enhanced long short-term memory (LSTM) LM using the future vector is proposed. In addition to the given history, the rest of the sequence will be also embedded by future vectors. This future vector can be incorporated with the LSTM LM, so it has the ability to model much longer term sequence level information. Experiments show that, the proposed new LSTM LM gets a better result on BLEU scores for long term sequence prediction. For the speech recognition rescoring, although the proposed LSTM LM obtains very slight gains, the new model seems obtain the great complementary with the conventional LSTM LM. Rescoring using both the new and conventional LSTM LMs can achieve a very large improvement on the word error rate.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132256646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2