Emre Yilmaz, Mitchell McLaren, H. V. D. Heuvel, D. V. Leeuwen
{"title":"Language diarization for semi-supervised bilingual acoustic model training","authors":"Emre Yilmaz, Mitchell McLaren, H. V. D. Heuvel, D. V. Leeuwen","doi":"10.1109/ASRU.2017.8268921","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268921","url":null,"abstract":"In this paper, we investigate several automatic transcription schemes for using raw bilingual broadcast news data in semi-supervised bilingual acoustic model training. Specifically, we compare the transcription quality provided by a bilingual ASR system with another system performing language diarization at the front-end followed by two monolingual ASR systems chosen based on the assigned language label. Our research focuses on the Frisian-Dutch code-switching (CS) speech that is extracted from the archives of a local radio broadcaster. Using 11 hours of manually transcribed Frisian speech as a reference, we aim to increase the amount of available training data by using these automatic transcription techniques. By merging the manually and automatically transcribed data, we learn bilingual acoustic models and run ASR experiments on the development and test data of the FAME! speech corpus to quantify the quality of the automatic transcriptions. Using these acoustic models, we present speech recognition and CS detection accuracies. The results demonstrate that applying language diarization to the raw speech data to enable using the monolingual resources improves the automatic transcription quality compared to a baseline system using a bilingual ASR system.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128739597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Higuchi, K. Kinoshita, Marc Delcroix, T. Nakatani
{"title":"Adversarial training for data-driven speech enhancement without parallel corpus","authors":"T. Higuchi, K. Kinoshita, Marc Delcroix, T. Nakatani","doi":"10.1109/ASRU.2017.8268914","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268914","url":null,"abstract":"This paper describes a way of performing data-driven speech enhancement for noise robust automatic speech recognition (ASR), where we train a model for speech enhancement without a parallel corpus. Data-driven speech enhancement with deep models has recently been investigated and proven to be a promising approach for ASR. However, for model training, we need a parallel corpus consisting of noisy speech signals and corresponding clean speech signals for supervision. Therefore a deep model can be trained only with a simulated dataset, and we cannot take advantage of a large number of noisy recordings that do not have corresponding clean speech signals. As a first step towards model training without supervision, this paper proposes a novel approach introducing adversarial training for a time-frequency mask estimator. Our cost function for model training is defined by discriminators instead of by using the distance between the model outputs and the supervision. The discriminators distinguish between true signals and enhanced signals obtained with time-frequency masks estimated with a mask estimator. The mask estimator is trained to cheat the discriminators, which enables the mask estimator to estimate the appropriate time-frequency masks without a parallel corpus. The enhanced signal is finally obtained with masking-based beamforming. Experimental results show that, even without exploiting parallel data, our speech enhancement approach achieves improved ASR performance compared with results obtained with unprocessed signals and achieves comparable ASR performance to that obtained with a model trained with a parallel corpus based on a minimum mean squared error (MMSE) criterion.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114728133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simplifying very deep convolutional neural network architectures for robust speech recognition","authors":"Joanna Rownicka, S. Renals, P. Bell","doi":"10.1109/ASRU.2017.8268941","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268941","url":null,"abstract":"Very deep convolutional neural networks (VDCNNs) have been successfully used in computer vision. More recently VDCNNs have been applied to speech recognition, using architectures adopted from computer vision. In this paper, we experimentally analyse the role of the components in VDCNN architectures for robust speech recognition. We have proposed a number of simplified VDCNN architectures, taking into account the use of fully-connected layers and down-sampling approaches. We have investigated three ways to down-sample feature maps: max-pooling, average-pooling, and convolution with increased stride. Our proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition. We have also extended our experiments to the MGB-3 task of multi-genre broadcast recognition using BBC TV recordings. The MGB-3 results indicate that the same architecture achieves the best result among our VDCNNs on this task as well.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134173185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chiori Hori, Takaaki Hori, Tim K. Marks, J. Hershey
{"title":"Early and late integration of audio features for automatic video description","authors":"Chiori Hori, Takaaki Hori, Tim K. Marks, J. Hershey","doi":"10.1109/ASRU.2017.8268968","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268968","url":null,"abstract":"This paper presents our approach to improve video captioning by integrating audio and video features. Video captioning is the task of generating a textual description to describe the content of a video. State-of-the-art approaches to video captioning are based on sequence-to-sequence models, in which a single neural network accepts sequential images and audio data, and outputs a sequence of words that best describe the input data in natural language. The network thus learns to encode the video input into an intermediate semantic representation, which can be useful in applications such as multimedia indexing, automatic narration, and audio-visual question answering. In our prior work, we proposed an attention-based multi-modal fusion mechanism to integrate image, motion, and audio features, where the multiple features are integrated in the network. Here, we apply hypothesis-level integration based on minimum Bayes-risk (MBR) decoding to further improve the caption quality, focusing on well-known evaluation metrics (BLEU and METEOR scores). Experiments with the YouTube2Text and MSR-VTT datasets demonstrate that combinations of early and late integration of multimodal features significantly improve the audio-visual semantic representation, as measured by the resulting caption quality. In addition, we compared the performance of our method using two different types of audio features: MFCC features, and the audio features extracted using SoundNet, which was trained to recognize objects and scenes from videos using only the audio signals.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"10 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133292081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross-domain speech recognition using nonparallel corpora with cycle-consistent adversarial networks","authors":"M. Mimura, S. Sakai, Tatsuya Kawahara","doi":"10.1109/ASRU.2017.8268927","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268927","url":null,"abstract":"Automatic speech recognition (ASR) systems often does not perform well when it is used in a different acoustic domain from the training time, such as utterances spoken in noisy environments or in different speaking styles. We propose a novel approach to cross-domain speech recognition based on acoustic feature mappings provided by a deep neural network, which is trained using nonparallel speech corpora from two different domains and using no phone labels. For training a target domain acoustic model, we generate “fake” target speech features from the labeleld source domain features using a mapping Gf. We can also generate “fake” source features for testing from the target features using the backward mapping Gb which has been learned simultaneously with G f. The mappings G f and Gb are trained as adversarial networks using a conventional adversarial loss and a cycle-consistency loss criterion that encourages the backward mapping to bring the translated feature back to the original as much as possible such that Gb(Gf (x)) ≈ x. In a highly challenging task of model adaptation only using domain speech features, our method achieved up to 16 % relative improvements in WER in the evaluation using the CHiME3 real test data. The backward mapping was also confirmed to be effective with a speaking style adaptation task.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125773125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Michaely, Xuedong Zhang, Gabor Simko, Carolina Parada, Petar S. Aleksic
{"title":"Keyword spotting for Google assistant using contextual speech recognition","authors":"A. Michaely, Xuedong Zhang, Gabor Simko, Carolina Parada, Petar S. Aleksic","doi":"10.1109/ASRU.2017.8268946","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268946","url":null,"abstract":"We present a novel keyword spotting (KWS) system that uses contextual automatic speech recognition (ASR). For voice-activated devices, it is common that a KWS system is run on the device in order to quickly detect a trigger phrase (e.g. “Ok Google”). After the trigger phrase is detected, the audio corresponding to the voice command that follows is streamed to the server. The audio is transcribed by the server-side ASR system and semantically processed to generate a response which is sent back to the device. Due to limited resources on the device, the device KWS system might introduce false accepts (FA) and false rejects (FR) that can cause an unsatisfactory user experience. We describe a system that uses server-side contextual ASR and trigger phrase non-terminals to improve overall KWS accuracy. We show that this approach can significantly reduce the FA rate (by 89%) while minimally increasing the FR rate (by 0.2%). Furthermore, we show that this system significantly improves the ASR quality, reducing Word Error Rate (WER) (by 10% to 50% relative), and allows the user to speak seamlessly, without pausing between the trigger phrase and the voice command.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127005203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yougen Yuan, C. Leung, Lei Xie, Hongjie Chen, B. Ma, Haizhou Li
{"title":"Extracting bottleneck features and word-like pairs from untranscribed speech for feature representation","authors":"Yougen Yuan, C. Leung, Lei Xie, Hongjie Chen, B. Ma, Haizhou Li","doi":"10.1109/ASRU.2017.8269010","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269010","url":null,"abstract":"We propose a framework to learn a frame-level speech representation in a scenario where no manual transcription is available. Our framework is based on pairwise learning using bottleneck features (BNFs). Initial frame-level features are extracted from a bottleneck-shaped multilingual deep neural network (DNN) which is trained with unsupervised phoneme-like labels. Word-like pairs are discovered in the untranscribed speech using the initial features, and frame alignment is performed on each word-like speech pair. The matching frame pairs are used as input-output to train another DNN with the mean square error (MSE) loss function. The final frame-level features are extracted from an internal hidden layer of MSE-based DNN. Our pairwise learned feature representation is evaluated on the ZeroSpeech 2017 challenge. The experiments show that pairwise learning improves phoneme discrimination in 10s and 120s test conditions. We find that it is important to use BNFs as initial features when pairwise learning is performed. With more word pairs obtained from the Switchboard corpus and its manual transcription, the phoneme discrimination of three languages in the evaluation data can further be improved despite data mismatch.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115258045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Deena, Raymond W. M. Ng, P. Madhyastha, Lucia Specia, Thomas Hain
{"title":"Exploring the use of acoustic embeddings in neural machine translation","authors":"S. Deena, Raymond W. M. Ng, P. Madhyastha, Lucia Specia, Thomas Hain","doi":"10.1109/ASRU.2017.8268971","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268971","url":null,"abstract":"Neural Machine Translation (NMT) has recently demonstrated improved performance over statistical machine translation and relies on an encoder-decoder framework for translating text from source to target. The structure of NMT makes it amenable to add auxiliary features, which can provide complementary information to that present in the source text. In this paper, auxiliary features derived from accompanying audio, are investigated for NMT and are compared and combined with text-derived features. These acoustic embeddings can help resolve ambiguity in the translation, thus improving the output. The following features are experimented with: Latent Dirichlet Allocation (LDA) topic vectors and GMM subspace i-vectors derived from audio. These are contrasted against: skip-gram/Word2Vec features and LDA features derived from text. The results are encouraging and show that acoustic information does help with NMT, leading to an overall 3.3% relative improvement in BLEU scores.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115464960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hayato Shibata, Taku Kato, T. Shinozaki, Shinji Watanabe
{"title":"Composite embedding systems for ZeroSpeech2017 Track1","authors":"Hayato Shibata, Taku Kato, T. Shinozaki, Shinji Watanabe","doi":"10.1109/ASRU.2017.8269012","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8269012","url":null,"abstract":"This paper investigates novel composite embedding systems for language-independent high-performance feature extraction using triphone-based DNN-HMM and character-based end-to-end speech recognition systems. The DNN-HMM is trained with phoneme transcripts based on a large-scale Japanese ASR recipe included in the Kaldi toolkit from the Corpus of Spontaneous Japanese (CSJ) with some modifications. The end-to-end ASR system is based on a hybrid architecture consisting of an attention-based encoder-decoder and connectionist temporal classification. This model is trained with multi-language speech data using character transcripts in a pure end-to-end fashion without requiring phonemic representation. Posterior features, PCA-transformed features, and bottleneck features are extracted from the two systems; then, various combinations of features are explored. Additionally, a bypassed autoencoder (bypassed AE) is proposed to normalize speaker characteristics in an unsupervised manner. An evaluation using the ABX test showed that the DNN-HMM-based CSJ bottleneck features resulted in a good performance regardless of the input language. The pre-activation vectors extracted from the multilingual end-to-end system with PCA provided a somewhat better performance than did the CSJ bottleneck features. The bypassed AE yielded an improved performance over a baseline AE. The lowest error rates were obtained by composite features that concatenated the end-to-end features with the CSJ bottleneck features.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114918968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ground truth estimation of spoken english fluency score using decorrelation penalized low-rank matrix factorization","authors":"Hoon Chung, Y. Lee, J. Park","doi":"10.1109/ASRU.2017.8268970","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268970","url":null,"abstract":"In this paper, we propose ground truth estimation of spoken English fluency scores using decorrelation penalized low-rank matrix factorization. Automatic spoken English fluency scoring is a general classification problem. The model parameters are trained to map input fluency features to corresponding ground truth scores, and then used to predict a score for an input utterance. Therefore, in order to estimate the model parameters to predict scores reliably, correct ground truth scores must be provided as target outputs. However, it is not simple to determine correct ground truth scores from human raters' scores, as these include subjective biases. Therefore, ground truth scores are usually estimated from human raters' scores, and two of the most common methods are averaging and voting. Although these methods are used successfully, questions remain about whether the methods effectively estimate ground truth scores by considering human raters' subjective biases and performance metric. Therefore, to address these issues, we propose an approach based on low-rank matrix factorization penalized by decorrelation. The proposed method decomposes human raters' scores to biases and latent scores maximizing Pearson's correlation. The effectiveness of the proposed approach was evaluated using human ratings of the Korean-Spoken English Corpus.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125600468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}