{"title":"Improved decision trees for multi-stream HMM-based audio-visual continuous speech recognition","authors":"Jing Huang, Karthik Venkat Ramanan","doi":"10.1109/ASRU.2009.5373454","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373454","url":null,"abstract":"HMM-based audio-visual speech recognition (AVSR) systems have shown success in continuous speech recognition by combining visual and audio information, especially in noisy environments. In this paper we study how to improve decision trees used to create context classes in HMM-based AVSR systems. Traditionally, visual models have been trained with the same context classes as the audio only models. In this paper we investigate the use of separate decision trees to model the context classes for the audio and visual streams independently. Additionally we investigate the use of viseme classes in the decision tree building for the visual stream. On experiments with a 37-speaker 1.5 hours test set (about 12000 words) of continuous digits in noise, we obtain about a 3% absolute (20% relative) gain on AVSR performance by using separate decision trees for the audio and visual streams when using viseme classes in decision tree building for the visual stream.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125998056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Constrained discriminative training of N-gram language models","authors":"A. Rastrow, A. Sethy, B. Ramabhadran","doi":"10.1109/ASRU.2009.5373338","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373338","url":null,"abstract":"In this paper, we present a novel version of discriminative training for N-gram language models. Language models impose language specific constraints on the acoustic hypothesis and are crucial in discriminating between competing acoustic hypotheses. As reported in the literature, discriminative training of acoustic models has yielded significant improvements in the performance of a speech recognition system, however, discriminative training for N-gram language models (LMs) has not yielded the same impact. In this paper, we present three techniques to improve the discriminative training of LMs, namely updating the back-off probability of unseen events, normalization of the N-gram updates to ensure a probability distribution and a relative-entropy based global constraint on the N-gram probability updates. We also present a framework for discriminative adaptation of LMs to a new domain and compare it to existing linear interpolation methods. Results are reported on the Broadcast News and the MIT lecture corpora. A modest improvement of 0.2% absolute (on Broadcast News) and 0.3% absolute (on MIT lectures) was observed with discriminatively trained LMs over state-of-the-art systems.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"416 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122461206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic punctuation generation for speech","authors":"Wenzhu Shen, Roger Peng Yu, F. Seide, Ji Wu","doi":"10.1109/ASRU.2009.5373365","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373365","url":null,"abstract":"Automatic generation of punctuation is an essential feature for many speech-to-text transcription tasks. This paper describes a Maximum A-Posteriori (MAP) approach for inserting punctuation marks into raw word sequences obtained from Automatic Speech Recognition (ASR). The system consists of an “acoustic model” (AM) for prosodic features (actually pause duration) and a “language model” (LM) for text-only features. The LM combines three components: an MLP-based trigger-word model and a forward and a backward trigram punctuation predictor. The separation into acoustic and language model allows to learn these models on different corpora, especially allowing the LM to be trained on large amounts of data (text) for which no acoustic information is available. We find that the trigger-word LM is very useful, and further improvement can be achieved when combining both prosodic and lexical information. We achieve an F-measure of 81.0% and 56.5% for voicemails and podcasts, respectively, on reference transcripts, and 69.6% for voicemails on ASR transcripts.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114277104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust Speaker Diarization for short speech recordings","authors":"David Imseng, G. Friedland","doi":"10.1109/ASRU.2009.5373254","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373254","url":null,"abstract":"We investigate a state-of-the-art Speaker Diarization system regarding its behavior on meetings that are much shorter (from 500 seconds down to 100 seconds) than those typically analyzed in Speaker Diarization benchmarks. First, the problems inherent to this task are analyzed. Then, we propose an approach that consists of a novel initialization parameter estimation method for typical state-of-the-art diarization approaches. The estimation method balances the relationship between the optimal value of the duration of speech data per Gaussian and the duration of the speech data, which is verified experimentally for the first time in this article. As a result, the Diarization Error Rate for short meetings extracted from the 2006, 2007, and 2009 NIST RT evaluation data is decreased by up to 50% relative.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133191742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reinforcing language model for speech translation with auxiliary data","authors":"Jia Cui, Yonggang Deng, Bowen Zhou","doi":"10.1109/ASRU.2009.5373308","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373308","url":null,"abstract":"Language model domain adaption usually uses a large quantity of auxiliary data in different genres and domains. It has mostly been relying on scoring functions for selection and it is typically independent of intended applications such as machine translation. In this paper, we present a novel domain adaptation approach that is directly motivated by the need of translation engine. We first identify interesting phrases by examining phrase translation tables, and then use those phrases as anchors to select useful and relevant sentences from general domain data, with the goal of improving domain coverage or providing additional contextual information. The experimental results on Farsi to English translation in military force protection domain and Chinese to English translation in travel domain show statistical significant gain using the reinforced language models over the baseline.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125743987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hung-yi Lee, Yueh-Lien Tang, Hao Tang, Lin-Shan Lee
{"title":"Spoken term detection from bilingual spontaneous speech using code-switched lattice-based structures for words and subword units","authors":"Hung-yi Lee, Yueh-Lien Tang, Hao Tang, Lin-Shan Lee","doi":"10.1109/ASRU.2009.5372901","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372901","url":null,"abstract":"This paper presents the first work known publicly on spoken term detection from bilingual spontaneous speech using code-switched lattice-based structures for word and subword units. The corpus used is the lectures with Chinese as the host language and English as the guest language recorded for a real course offered in National Taiwan University. The techniques reported here have been successfully implemented and tested in a real lecture system now available on-line over the Internet. We also present the approaches of using word fragment as the subword unit for English, and analyse the difficult issues when code-switched lattice-based structures for subword units are used for tasks involving languages of quite different natures.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"233 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132959844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discriminative Product-of-Expert acoustic mapping for cross-lingual phone recognition","authors":"K. Sim","doi":"10.1109/ASRU.2009.5372910","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5372910","url":null,"abstract":"This paper presents a Product-of-Expert framework to perform probabilistic acoustic mapping for cross-lingual phone recognition. Under this framework, the posterior probabilities of the target HMM states are modelled as the weighted product of experts, where the experts or their weights are modelled as functions of the posterior probabilities of the source HMM states generated by a foreign phone recogniser. Careful choice of these functions leads to the Product-of-Posterior and Posterior Weighted Product-of-Expert models, which can be conveniently represented as 2-layer and 3-layer feed-forward neural networks respectively. Therefore, the commonly used error back-propagation method can be used to discriminatively train the model parameters. Experimental results are presented on the NTIMIT database using the Czech, Hungarian and Russian hybrid NN/HMM recognisers as the foreign phone recognisers to recognise English phones. With only about 15.6 minutes of training data, the best acoustic mapping model achieved 46.00% phone error rate, which is not far behind the 43.55% performance of the NN/HMM system trained directly on the full 3.31 hours of data.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116106947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiong Xiao, Jinyu Li, Chng Eng Siong, Haizhou Li, Chin-Hui Lee
{"title":"A study on hidden Markov model's generalization capability for speech recognition","authors":"Xiong Xiao, Jinyu Li, Chng Eng Siong, Haizhou Li, Chin-Hui Lee","doi":"10.1109/ASRU.2009.5373359","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373359","url":null,"abstract":"From statistical learning theory, the generalization capability of a model is the ability to generalize well on unseen test data which follow the same distribution as the training data. This paper investigates how generalization capability can also improve robustness when testing and training data are from different distributions in the context of speech recognition. Two discriminative training (DT) methods are used to train the hidden Markov model (HMM) for better generalization capability, namely the minimum classification error (MCE) and the soft-margin estimation (SME) methods. Results on Aurora-2 task show that both SME and MCE are effective in improving one of the measures of acoustic model's generalization capability, i.e. the margin of the model, with SME be moderately more effective. In addition, the better generalization capability translates into better robustness of speech recognition performance, even when there is significant mismatch between the training and testing data. We also applied the mean and variance normalization (MVN) to preprocess the data to reduce the training-testing mismatch. After MVN, MCE and SME perform even better as the generalization capability now is more closely related to robustness. The best performance on Aurora-2 is obtained from SME and about 28% relative error rate reduction is achieved over the MVN baseline system. Finally, we also use SME to demonstrate the potential of better generalization capability in improving robustness in more realistic noisy task using the Aurora-3 task, and significant improvements are obtained.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123938620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards the use of inferred cognitive states in language modeling","authors":"Nigel G. Ward, Alejandro Vega","doi":"10.1109/ASRU.2009.5373290","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373290","url":null,"abstract":"In spoken dialog, speakers are simultaneously engaged in various mental processes, and it seems likely that the word that will be said next depends, to some extent, on the states of these mental processes. Further, these states can be inferred, to some extent, from properties of the speaker's voice as they change from moment to moment. As a illustration of how to apply these ideas in language modeling, we examine volume and speaking rate as predictors of the upcoming word. Combining the information which these provide with a trigram model gave a 2.6% improvement in perplexity.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122388794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Three-layer optimizations for fast GMM computations on GPU-like parallel processors","authors":"Kshitij Gupta, John Douglas Owens","doi":"10.1109/ASRU.2009.5373410","DOIUrl":"https://doi.org/10.1109/ASRU.2009.5373410","url":null,"abstract":"In this paper we focus on optimizing compute and memory-bandwidth-intensive GMM computations for low-end, small-form-factor devices running on GPU-like parallel processors. With special emphasis on tackling the memory bandwidth issue that is exacerbated by a lack of CPU-like caches providing temporal locality on GPU-like parallel processors, we propose modifications to three well-known GMM computation reduction techniques. We find considerable locality at the frame, CI-GMM, and mixture layers of GMM compute, and show how it can be extracted by following a chunk-based technique of processing multiple frames for every load of a GMM. On a 1,000- word, command-and-control, continuous-speech task, we are able to achieve compute and memory bandwidth savings of over 60% and 90% respectively, with some degradation in accuracy, when compared to existing GPU-based fast GMM computation techniques.","PeriodicalId":292194,"journal":{"name":"2009 IEEE Workshop on Automatic Speech Recognition & Understanding","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124772574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}