2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)最新文献_第6页

Language diarization for semi-supervised bilingual acoustic model training 半监督双语声学模型训练的语言化

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268921

Emre Yilmaz, Mitchell McLaren, H. V. D. Heuvel, D. V. Leeuwen

{"title":"Language diarization for semi-supervised bilingual acoustic model training","authors":"Emre Yilmaz, Mitchell McLaren, H. V. D. Heuvel, D. V. Leeuwen","doi":"10.1109/ASRU.2017.8268921","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268921","url":null,"abstract":"In this paper, we investigate several automatic transcription schemes for using raw bilingual broadcast news data in semi-supervised bilingual acoustic model training. Specifically, we compare the transcription quality provided by a bilingual ASR system with another system performing language diarization at the front-end followed by two monolingual ASR systems chosen based on the assigned language label. Our research focuses on the Frisian-Dutch code-switching (CS) speech that is extracted from the archives of a local radio broadcaster. Using 11 hours of manually transcribed Frisian speech as a reference, we aim to increase the amount of available training data by using these automatic transcription techniques. By merging the manually and automatically transcribed data, we learn bilingual acoustic models and run ASR experiments on the development and test data of the FAME! speech corpus to quantify the quality of the automatic transcriptions. Using these acoustic models, we present speech recognition and CS detection accuracies. The results demonstrate that applying language diarization to the raw speech data to enable using the monolingual resources improves the automatic transcription quality compared to a baseline system using a bilingual ASR system.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128739597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Adversarial training for data-driven speech enhancement without parallel corpus 无并行语料库的数据驱动语音增强对抗训练

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268914

T. Higuchi, K. Kinoshita, Marc Delcroix, T. Nakatani

{"title":"Adversarial training for data-driven speech enhancement without parallel corpus","authors":"T. Higuchi, K. Kinoshita, Marc Delcroix, T. Nakatani","doi":"10.1109/ASRU.2017.8268914","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268914","url":null,"abstract":"This paper describes a way of performing data-driven speech enhancement for noise robust automatic speech recognition (ASR), where we train a model for speech enhancement without a parallel corpus. Data-driven speech enhancement with deep models has recently been investigated and proven to be a promising approach for ASR. However, for model training, we need a parallel corpus consisting of noisy speech signals and corresponding clean speech signals for supervision. Therefore a deep model can be trained only with a simulated dataset, and we cannot take advantage of a large number of noisy recordings that do not have corresponding clean speech signals. As a first step towards model training without supervision, this paper proposes a novel approach introducing adversarial training for a time-frequency mask estimator. Our cost function for model training is defined by discriminators instead of by using the distance between the model outputs and the supervision. The discriminators distinguish between true signals and enhanced signals obtained with time-frequency masks estimated with a mask estimator. The mask estimator is trained to cheat the discriminators, which enables the mask estimator to estimate the appropriate time-frequency masks without a parallel corpus. The enhanced signal is finally obtained with masking-based beamforming. Experimental results show that, even without exploiting parallel data, our speech enhancement approach achieves improved ASR performance compared with results obtained with unprocessed signals and achieves comparable ASR performance to that obtained with a model trained with a parallel corpus based on a minimum mean squared error (MMSE) criterion.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114728133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Simplifying very deep convolutional neural network architectures for robust speech recognition 简化非常深的卷积神经网络架构，用于鲁棒语音识别

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268941

Joanna Rownicka, S. Renals, P. Bell

引用次数: 11

Early and late integration of audio features for automatic video description 早期和后期集成音频功能，用于自动视频描述

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268968

Chiori Hori, Takaaki Hori, Tim K. Marks, J. Hershey

{"title":"Early and late integration of audio features for automatic video description","authors":"Chiori Hori, Takaaki Hori, Tim K. Marks, J. Hershey","doi":"10.1109/ASRU.2017.8268968","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268968","url":null,"abstract":"This paper presents our approach to improve video captioning by integrating audio and video features. Video captioning is the task of generating a textual description to describe the content of a video. State-of-the-art approaches to video captioning are based on sequence-to-sequence models, in which a single neural network accepts sequential images and audio data, and outputs a sequence of words that best describe the input data in natural language. The network thus learns to encode the video input into an intermediate semantic representation, which can be useful in applications such as multimedia indexing, automatic narration, and audio-visual question answering. In our prior work, we proposed an attention-based multi-modal fusion mechanism to integrate image, motion, and audio features, where the multiple features are integrated in the network. Here, we apply hypothesis-level integration based on minimum Bayes-risk (MBR) decoding to further improve the caption quality, focusing on well-known evaluation metrics (BLEU and METEOR scores). Experiments with the YouTube2Text and MSR-VTT datasets demonstrate that combinations of early and late integration of multimodal features significantly improve the audio-visual semantic representation, as measured by the resulting caption quality. In addition, we compared the performance of our method using two different types of audio features: MFCC features, and the audio features extracted using SoundNet, which was trained to recognize objects and scenes from videos using only the audio signals.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"10 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133292081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks 基于周期一致对抗网络的非并行语料库跨域语音识别

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268927

M. Mimura, S. Sakai, Tatsuya Kawahara

{"title":"Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks","authors":"M. Mimura, S. Sakai, Tatsuya Kawahara","doi":"10.1109/ASRU.2017.8268927","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268927","url":null,"abstract":"Automatic speech recognition (ASR) systems often does not perform well when it is used in a different acoustic domain from the training time, such as utterances spoken in noisy environments or in different speaking styles. We propose a novel approach to cross-domain speech recognition based on acoustic feature mappings provided by a deep neural network, which is trained using nonparallel speech corpora from two different domains and using no phone labels. For training a target domain acoustic model, we generate “fake” target speech features from the labeleld source domain features using a mapping Gf. We can also generate “fake” source features for testing from the target features using the backward mapping Gb which has been learned simultaneously with G f. The mappings G f and Gb are trained as adversarial networks using a conventional adversarial loss and a cycle-consistency loss criterion that encourages the backward mapping to bring the translated feature back to the original as much as possible such that Gb(Gf (x)) ≈ x. In a highly challenging task of model adaptation only using domain speech features, our method achieved up to 16 % relative improvements in WER in the evaluation using the CHiME3 real test data. The backward mapping was also confirmed to be effective with a speaking style adaptation task.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125773125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Keyword spotting for Google assistant using contextual speech recognition 使用上下文语音识别的谷歌助手关键字定位

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268946

A. Michaely, Xuedong Zhang, Gabor Simko, Carolina Parada, Petar S. Aleksic

{"title":"Keyword spotting for Google assistant using contextual speech recognition","authors":"A. Michaely, Xuedong Zhang, Gabor Simko, Carolina Parada, Petar S. Aleksic","doi":"10.1109/ASRU.2017.8268946","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268946","url":null,"abstract":"We present a novel keyword spotting (KWS) system that uses contextual automatic speech recognition (ASR). For voice-activated devices, it is common that a KWS system is run on the device in order to quickly detect a trigger phrase (e.g. “Ok Google”). After the trigger phrase is detected, the audio corresponding to the voice command that follows is streamed to the server. The audio is transcribed by the server-side ASR system and semantically processed to generate a response which is sent back to the device. Due to limited resources on the device, the device KWS system might introduce false accepts (FA) and false rejects (FR) that can cause an unsatisfactory user experience. We describe a system that uses server-side contextual ASR and trigger phrase non-terminals to improve overall KWS accuracy. We show that this approach can significantly reduce the FA rate (by 89%) while minimally increasing the FR rate (by 0.2%). Furthermore, we show that this system significantly improves the ASR quality, reducing Word Error Rate (WER) (by 10% to 50% relative), and allows the user to speak seamlessly, without pausing between the trigger phrase and the voice command.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127005203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 61

Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation 从未转录语音中提取瓶颈特征和类词对进行特征表示

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269010

Yougen Yuan, C. Leung, Lei Xie, Hongjie Chen, B. Ma, Haizhou Li

{"title":"Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation","authors":"Yougen Yuan, C. Leung, Lei Xie, Hongjie Chen, B. Ma, Haizhou Li","doi":"10.1109/ASRU.2017.8269010","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269010","url":null,"abstract":"We propose a framework to learn a frame-level speech representation in a scenario where no manual transcription is available. Our framework is based on pairwise learning using bottleneck features (BNFs). Initial frame-level features are extracted from a bottleneck-shaped multilingual deep neural network (DNN) which is trained with unsupervised phoneme-like labels. Word-like pairs are discovered in the untranscribed speech using the initial features, and frame alignment is performed on each word-like speech pair. The matching frame pairs are used as input-output to train another DNN with the mean square error (MSE) loss function. The final frame-level features are extracted from an internal hidden layer of MSE-based DNN. Our pairwise learned feature representation is evaluated on the ZeroSpeech 2017 challenge. The experiments show that pairwise learning improves phoneme discrimination in 10s and 120s test conditions. We find that it is important to use BNFs as initial features when pairwise learning is performed. With more word pairs obtained from the Switchboard corpus and its manual transcription, the phoneme discrimination of three languages in the evaluation data can further be improved despite data mismatch.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115258045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Exploring the use of acoustic embeddings in neural machine translation 探讨声嵌入在神经机器翻译中的应用

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268971

S. Deena, Raymond W. M. Ng, P. Madhyastha, Lucia Specia, Thomas Hain

引用次数: 10

Composite embedding systems for ZeroSpeech2017 Track1 零语音的复合嵌入系统

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8269012

Hayato Shibata, Taku Kato, T. Shinozaki, Shinji Watanabe

{"title":"Composite embedding systems for ZeroSpeech2017 Track1","authors":"Hayato Shibata, Taku Kato, T. Shinozaki, Shinji Watanabe","doi":"10.1109/ASRU.2017.8269012","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269012","url":null,"abstract":"This paper investigates novel composite embedding systems for language-independent high-performance feature extraction using triphone-based DNN-HMM and character-based end-to-end speech recognition systems. The DNN-HMM is trained with phoneme transcripts based on a large-scale Japanese ASR recipe included in the Kaldi toolkit from the Corpus of Spontaneous Japanese (CSJ) with some modifications. The end-to-end ASR system is based on a hybrid architecture consisting of an attention-based encoder-decoder and connectionist temporal classification. This model is trained with multi-language speech data using character transcripts in a pure end-to-end fashion without requiring phonemic representation. Posterior features, PCA-transformed features, and bottleneck features are extracted from the two systems; then, various combinations of features are explored. Additionally, a bypassed autoencoder (bypassed AE) is proposed to normalize speaker characteristics in an unsupervised manner. An evaluation using the ABX test showed that the DNN-HMM-based CSJ bottleneck features resulted in a good performance regardless of the input language. The pre-activation vectors extracted from the multilingual end-to-end system with PCA provided a somewhat better performance than did the CSJ bottleneck features. The bypassed AE yielded an improved performance over a baseline AE. The lowest error rates were obtained by composite features that concatenated the end-to-end features with the CSJ bottleneck features.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114918968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Ground truth estimation of spoken english fluency score using decorrelation penalized low-rank matrix factorization 基于去相关惩罚低秩矩阵分解的英语口语流利度评分的真值估计

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2017-12-01 DOI: 10.1109/ASRU.2017.8268970

Hoon Chung, Y. Lee, J. Park

{"title":"Ground truth estimation of spoken english fluency score using decorrelation penalized low-rank matrix factorization","authors":"Hoon Chung, Y. Lee, J. Park","doi":"10.1109/ASRU.2017.8268970","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268970","url":null,"abstract":"In this paper, we propose ground truth estimation of spoken English fluency scores using decorrelation penalized low-rank matrix factorization. Automatic spoken English fluency scoring is a general classification problem. The model parameters are trained to map input fluency features to corresponding ground truth scores, and then used to predict a score for an input utterance. Therefore, in order to estimate the model parameters to predict scores reliably, correct ground truth scores must be provided as target outputs. However, it is not simple to determine correct ground truth scores from human raters' scores, as these include subjective biases. Therefore, ground truth scores are usually estimated from human raters' scores, and two of the most common methods are averaging and voting. Although these methods are used successfully, questions remain about whether the methods effectively estimate ground truth scores by considering human raters' subjective biases and performance metric. Therefore, to address these issues, we propose an approach based on low-rank matrix factorization penalized by decorrelation. The proposed method decomposes human raters' scores to biases and latent scores maximizing Pearson's correlation. The effectiveness of the proposed approach was evaluated using human ratings of the Korean-Spoken English Corpus.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125600468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1