{"title":"Two extensions to ensemble speaker and speaking environment modeling for robust automatic speech recognition","authors":"Yu Tsao, Chin-Hui Lee","doi":"10.1109/ASRU.2007.4430087","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430087","url":null,"abstract":"Recently an ensemble speaker and speaking environment modeling (ESSEM) approach to characterizing unknown testing environments was studied for robust speech recognition. Each environment is modeled by a super-vector consisting of the entire set of mean vectors from all Gaussian densities of a set of HMMs for a particular environment. The super-vector for a new testing environment is then obtained by an affine transformation on the ensemble super-vectors. In this paper, we propose a minimum classification error training procedure to obtain discriminative ensemble elements, and a super-vector clustering technique to achieve refined ensemble structures. We test these two extentions to ESSEM on Aurora2. In a per-utterance unsupervised adaptation mode we achieved an average WER of 4.99% from OdB to 20 dB conditions with these two extentions when compared with a 5.51% WER obtained with the ML-trained gender-dependent baseline. To our knowledge this represents the best result reported in the literature on the Aurora2 connected digit recognition task.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117125164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Huang, Oriol Vinyals, G. Friedland, Christian A. Müller, Nikki Mirghafori, Chuck Wooters
{"title":"A fast-match approach for robust, faster than real-time speaker diarization","authors":"Yan Huang, Oriol Vinyals, G. Friedland, Christian A. Müller, Nikki Mirghafori, Chuck Wooters","doi":"10.1109/ASRU.2007.4430196","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430196","url":null,"abstract":"During the past few years, speaker diarization has achieved satisfying accuracy in terms of speaker Diarization Error Rate (DER). The most successful approaches, based on agglomerative clustering, however, exhibit an inherent computational complexity which makes real-time processing, especially in combination with further processing steps, almost impossible. In this article we present a framework to speed up agglomerative clustering speaker diarization. The basic idea is to adopt a computationally cheap method to reduce the hypothesis space of the more expensive and accurate model selection via Bayesian Information Criterion (BIC). Two strategies based on the pitch-correlogram and the unscented-trans-form based approximation of KL-divergence are used independently as a fast-match approach to select the most likely clusters to merge. We performed the experiments using the existing ICSI speaker diarization system. The new system using KL-divergence fast-match strategy only performs 14% of total BIC comparisons needed in the baseline system, speeds up the system by 41% without affecting the speaker Diarization Error Rate (DER). The result is a robust and faster than real-time speaker diarization system.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"26 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132352790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic lexical pronunciations generation and update","authors":"Ghinwa F. Choueiter, S. Seneff, James R. Glass","doi":"10.1109/ASRU.2007.4430113","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430113","url":null,"abstract":"Most automatic speech recognizers use a dictionary that maps words to one or more canonical pronunciations. Such entries are typically hand-written by lexical experts. In this research, we investigate a new approach for automatically generating lexical pronunciations using a linguistically motivated subword model, and refining the pronunciations with spoken examples. The approach is evaluated on an isolated word recognition task with a 2 k lexicon of restaurant and street names. A letter-to-sound model is first used to generate seed baseforms for the lexicon. Then spoken utterances of words in the lexicon are presented to a subword recognizer and the top hypotheses are used to update the lexical base-forms. The spelling of each word is also used to constrain the subword search space and generate spelling-constrained baseforms. The results obtained are quite encouraging and indicate that our approach can be successfully used to learn valid pronunciations of new words.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129859588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Krishna Subramanian, D. Stallard, R. Prasad, S. Saleem, P. Natarajan
{"title":"Semantic translation error rate for evaluating translation systems","authors":"Krishna Subramanian, D. Stallard, R. Prasad, S. Saleem, P. Natarajan","doi":"10.1109/ASRU.2007.4430144","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430144","url":null,"abstract":"In this paper, we introduce a new metric which we call the semantic translation error rate, or STER, for evaluating the performance of machine translation systems. STER is based on the previously published translation error rate (TER) (Snover et al., 2006) and METEOR (Banerjee and Lavie, 2005) metrics. Specifically, STER extends TER in two ways: first, by incorporating word equivalence measures (WordNet and Porter stemming) standardly used by METEOR, and second, by disallowing alignments of concept words to non-concept words (aka stop words). We show how these features make STER alignments better suited for human-driven analysis than standard TER. We also present experimental results that show that STER is better correlated to human judgments than TER. Finally, we compare STER to METEOR, and illustrate that METEOR scores computed using the STER alignments have similar statistical properties to METEOR scores computed using METEOR alignments.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130480239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Riedhammer, G. Stemmer, T. Haderlein, M. Schuster, F. Rosanowski, E. Nöth, A. Maier
{"title":"Towards robust automatic evaluation of pathologic telephone speech","authors":"K. Riedhammer, G. Stemmer, T. Haderlein, M. Schuster, F. Rosanowski, E. Nöth, A. Maier","doi":"10.1109/ASRU.2007.4430200","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430200","url":null,"abstract":"For many aspects of speech therapy an objective evaluation of the intelligibility of a patient's speech is needed. We investigate the evaluation of the intelligibility of speech by means of automatic speech recognition. Previous studies have shown that measures like word accuracy are consistent with human experts' ratings. To ease the patient's burden, it is highly desirable to conduct the assessment via phone. However, the telephone channel influences the quality of the speech signal which negatively affects the results. To reduce inaccuracies, we propose a combination of two speech recognizers. Experiments on two sets of pathological speech show that the combination results in consistent improvements in the correlation between the automatic evaluation and the ratings by human experts. Furthermore, the approach leads to reductions of 10% and 25% of the maximum error of the intelligibility measure.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"269 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133156696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Phonological feature based variable frame rate scheme for improved speech recognition","authors":"A. Sangwan, J. Hansen","doi":"10.1109/ASRU.2007.4430177","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430177","url":null,"abstract":"In this paper, we propose a new scheme for variable frame rate (VFR) feature processing based on high level segmentation (HLS) of speech into broad phone classes. Traditional fixed-rate processing is not capable of accurately reflecting the dynamics of continuous speech. On the other hand, the proposed VFR scheme adapts the temporal representation of the speech signal by tying the framing strategy with the detected phone class sequence. The phone classes are detected and segmented by using appropriately trained phonological features (PFs). In this manner, the proposed scheme is capable of tracking the evolution of speech due to the underlying phonetic content, and exploiting the non-uniform information flow-rate of speech by using a variable framing strategy. The new VFR scheme is applied to automatic speech recognition of TIMIT and NTIMIT corpora, where it is compared to a traditional fixed window-size/frame-rate scheme. Our experiments yield encouraging results with relative reductions of 24% and 8% in WER (word error rate) for TIMIT and NTIMIT tasks, respectively.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126632039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthias H. Heie, E. Whittaker, Josef R. Novak, S. Furui
{"title":"A language modeling approach to question answering on speech transcripts","authors":"Matthias H. Heie, E. Whittaker, Josef R. Novak, S. Furui","doi":"10.1109/ASRU.2007.4430112","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430112","url":null,"abstract":"This paper presents a language modeling approach to sentence retrieval for Question Answering (QA) that we used in Question Answering on speech transcripts (QAst), a pilot task at the Cross Language Evaluation Forum (CLEF) evaluations 2007. A language model (LM) is generated for each sentence and these models are combined with document LMs to take advantage of contextual information. A query expansion technique using class models is proposed and included in our framework. Finally, our method's impact on exact answer extraction is evaluated. We show that combining sentence LMs with document LMs significantly improves sentence retrieval performance, and that this sentence retrieval approach leads to better answer extraction performance.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"292 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121491428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keelan Evanini, David Suendermann-Oeft, R. Pieraccini
{"title":"Call classification for automated troubleshooting on large corpora","authors":"Keelan Evanini, David Suendermann-Oeft, R. Pieraccini","doi":"10.1109/ASRU.2007.4430110","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430110","url":null,"abstract":"This paper compares six algorithms for call classification in the framework of a dialog system for automated troubleshooting. The comparison is carried out on large datasets, each consisting of over 100,000 utterances from two domains: television (TV) and Internet (INT). In spite of the high number of classes (79 for TV and 58 for INT), the best classifier (maximum entropy on word bigrams) achieved more than 77% classification accuracy on the TV dataset and 81% on the INT dataset.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121623531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining statistical models with symbolic grammar in parsing","authors":"Junichi Tsujii","doi":"10.1109/ASRU.2007.4430140","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430140","url":null,"abstract":"There are two streams of research in computational linguistics and natural language processing, the empiricist and rationalist traditions. Theories and computational techniques in these two streams have been developed separately and different in nature. Although the two traditions have been considered irreconcilable and have often been antagonistic toward each other, I have contention with this assertion, and thus claim that these two research streams in linguistics, despite or due to their differences, can be complementary to each other and should be combined into a unified methodology. I will demonstrate in my talk that there have been interesting developments in this direction of integration, and would like to discuss some of the recent results with their implications on engineering application.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116797752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Variational Kullback-Leibler divergence for Hidden Markov models","authors":"J. Hershey, P. Olsen, Steven J. Rennie","doi":"10.1109/ASRU.2007.4430132","DOIUrl":"https://doi.org/10.1109/ASRU.2007.4430132","url":null,"abstract":"Divergence measures are widely used tools in statistics and pattern recognition. The Kullback-Leibler (KL) divergence between two hidden Markov models (HMMs) would be particularly useful in the fields of speech and image recognition. Whereas the KL divergence is tractable for many distributions, including Gaussians, it is not in general tractable for mixture models or HMMs. Recently, variational approximations have been introduced to efficiently compute the KL divergence and Bhattacharyya divergence between two mixture models, by reducing them to the divergences between the mixture components. Here we generalize these techniques to approach the divergence between HMMs using a recursive backward algorithm. Two such methods are introduced, one of which yields an upper bound on the KL divergence, the other of which yields a recursive closed-form solution. The KL and Bhattacharyya divergences, as well as a weighted edit-distance technique, are evaluated for the task of predicting the confusability of pairs of words.","PeriodicalId":371729,"journal":{"name":"2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124385471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}