{"title":"Automatic evaluation methods of a speech translation system's capability","authors":"F. Sugaya, K. Yasuda, T. Takezawa, S. Yamamoto","doi":"10.1109/ASRU.2001.1034661","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034661","url":null,"abstract":"The main goal of the paper is to propose automatic schemes for the translation paired comparison method, which was proposed by the authors to evaluate precisely a speech translation system's capability. In the method, the outputs of the speech translation system are subjectively compared with the results of native Japanese taking the Test of English for International Communication (TOEIC), which is used as a measure of a person's speech translation capability. Experiments are conducted on TDMT, which is a subsystem of the Japanese-to-English speech translation system ATR-MATRIX developed at ATR Interpreting Telecommunications Research Laboratories. The winning rate of TDMT shows a good correlation with the TOEIC scores of the examinees. A regression analysis on the subjective results shows that the translation capability of TDMT matches a person scoring around 700 on the TOEIC. The automatic evaluation methods use DP-based similarity, which is calculated by DP distances between a translation output and multiple translation answers. The answers are collected by two methods: paraphrasing and query from a parallel corpus. In both types of collection, the similarity shows the same good correlation with the TOEIC scores of the examinees as the subjective winning rate. Regression analysis using similarity shows that the system's matched point is around 750. We also show effects of paraphrased data.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134325812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vocabulary independent speech recognition using particles","authors":"E. Whittaker, J.M. Van Thong, P. Moreno","doi":"10.1109/ASRU.2001.1034650","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034650","url":null,"abstract":"A method is presented for performing speech recognition that is not dependent on a fixed word vocabulary. Particles are used as the recognition units in a speech recognition system which permits word-vocabulary independent speech decoding. A particle represents a concatenated phone sequence. Each string of particles that represents a word in the one-best hypothesis from the particle speech recognizer is expanded into a list of phonetically similar word candidates using a phone confusion matrix. The resulting word graph is then re-decoded using a word language model to produce the final word hypothesis. Preliminary results on the DARPA HUB4 97 and 98 evaluation sets using word bigram redecoding of the particle hypothesis show a WER of between 2.2% and 2.9% higher than using a word bigram speech recognizer of comparable complexity. The method has potential applications in spoken document retrieval for recovering out-of-vocabulary words and also in client-server based speech recognition.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"232 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134326254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Piecewise-linear transformation-based HMM adaptation for noisy speech","authors":"Zhipeng Zhang, S. Furui","doi":"10.1109/ASRU.2001.1034612","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034612","url":null,"abstract":"This paper proposes a new method using a piecewise-linear transformation for adapting phone HMM to noisy speech. Various noises are clustered according to their acoustic properties and signal-to-noise ratios (SNR), and a noisy speech HMM corresponding to each clustered noise is made. Based on the likelihood maximization criterion, the HMM which best matches the input speech is selected and further adapted using a linear transformation. The proposed method was evaluated by recognizing noisy broadcast-news speech. It was confirmed that the proposed method was effective in recognizing numerically noise-added speech and actual noisy speech by a wide range of speakers under various noise conditions.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133129858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimated rank pruning and Java-based speech recognition","authors":"N. Jevtic, A. Klautau, A. Orlitsky","doi":"10.1109/ASRU.2001.1034669","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034669","url":null,"abstract":"Most speech recognition systems search through large finite state machines to find the most likely path, or hypothesis. Efficient search in these large spaces requires pruning of some hypotheses. Popular pruning techniques include probability pruning which keeps only hypotheses whose probability falls within a prescribed factor from the most likely one, and rank pruning which keeps only a prescribed number of the most probable hypotheses. Rank pruning provides better control over memory use and search complexity, but it requires sorting of the hypotheses, a time consuming task that may slow the recognition process. We propose a pruning technique which combines the advantages of probability and rank pruning. Its time complexity is similar to that of probability pruning and its search-space size, memory consumption, and recognition accuracy are comparable to those of rank pruning. We also describe a research-motivated Java-based speech recognition system that is being built at UCSD.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114401006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Finite-state transducers for speech-input translation","authors":"F. Casacuberta","doi":"10.1109/ASRU.2001.1034664","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034664","url":null,"abstract":"Nowadays, hidden Markov models (HMMs) and n-grams are the basic components of the most successful speech recognition systems. In such systems, HMMs (the acoustic models) are integrated into a n-gram or a stochastic finite-state grammar (the language model). Similar models can be used for speech translation, and HMMs (the acoustic models) can be integrated into a finite-state transducer (the translation model). Moreover, the translation process can be performed by searching for an optimal path of states in the integrated network. The output of this search process is a target word sequence associated to the optimal path. In speech translation, HMMs can be trained from a source speech corpus, and the translation model can be learned automatically from a parallel training corpus. This approach has been assessed in the framework of the EUTRANS project, founded by the European Union. Extensive speech-input experiments have been carried out with translations from Spanish to English and from Italian to English translation, in an application involving the interaction (by telephone) of a customer with a receptionist at the front-desk of a hotel. A summary of the most relevant results are presented in this paper.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115720114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trend tying in the segmental-feature HMM","authors":"Young-Sun Yun","doi":"10.1109/ASRU.2001.1034585","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034585","url":null,"abstract":"We present a reduction method for the number of parameters in a segmental-feature HMM (SFHMM). If the SFHMM shows better results than the CHMM, the number of parameters is greater than that of the CHMM. Therefore, there is a need for a new approach that reduces the number of parameters. In general, the trajectory can be separated by the trend and location. Since the trend means the variation of segmental features and occupies a large portion of the SFHMM, if the trend is shared, the number of parameters of the SFHMM may be decreased. The proposed method shares the trend part of trajectories by quantization. The experiments are performed on the TIMIT corpus to examine the effectiveness of the trend tying. The experimental results show that its performance is the almost same as that of previous studies. To obtain better results with a small amount of parameters, the various conditions for the trajectory components must be considered.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114223034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple time resolutions for derivatives of Mel-frequency cepstral coefficients","authors":"G. Stemmer, C. Hacker, E. Noth, H. Niemann","doi":"10.1109/ASRU.2001.1034583","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034583","url":null,"abstract":"Most speech recognition systems are based on Mel-frequency cepstral coefficients and their first- and second-order derivatives. The derivatives are normally approximated by fitting a linear regression line to a fixed-length segment of consecutive frames. The time resolution and smoothness of the estimated derivative depends on the length of the segment. We present an approach to improve the representation of speech dynamics, which is based on the combination of multiple time resolutions. The resulting feature vector is transformed to reduce its dimension and the correlation between the features. Another possibility, which has also been evaluated, is to use probabilistic PCA (PPCA) for the output distributions of the HMMs. Different configurations of multiple time resolutions are evaluated as well. When compared to the baseline system, a significant reduction of the word error rate can been achieved.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"170 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125935493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Patrick Nguyen, Luca Rigazio, C. Wellekens, J. Junqua
{"title":"Construction of model-space constraints","authors":"Patrick Nguyen, Luca Rigazio, C. Wellekens, J. Junqua","doi":"10.1109/ASRU.2001.1034591","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034591","url":null,"abstract":"HMM systems exhibit a large amount of redundancy. To this end, a technique called eigenvoices was found to be very effective for speaker adaptation. The correlation between HMM parameters is exploited via a linear constraint called eigenspace. This constraint is obtained through a PCA of the training speakers. We show how PCA can be linked to the maximum-likelihood criterion. Then, we extend the method to LDA transformations and piecewise linear constraints. On the Wall Street Journal (WSJ) dictation task, we obtain 1.7% WER improvement (15% relative) when using self-adaptation.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125155123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Histogram based normalization in the acoustic feature space","authors":"S. Molau, Michael Pitz, H. Ney","doi":"10.1109/ASRU.2001.1034579","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034579","url":null,"abstract":"We describe a technique called histogram normalization that aims at normalizing feature space distributions at different stages in the signal analysis front-end, namely the log-compressed filterbank vectors, cepstrum coefficients, and LDA (local density approximation) transformed acoustic vectors. Best results are obtained at the filterbank, and in most cases there is a minor additional gain when normalization is applied sequentially at different stages. We show that histogram normalization performs best if applied both in training and recognition, and that smoothing the target histogram obtained on the training data is also helpful. On the VerbMobil II corpus, a German large-vocabulary conversational speech recognition task, we achieve an overall reduction in word error rate of about 10% relative.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123443284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ASR in portable wireless devices","authors":"Olli Viikki","doi":"10.1109/ASRU.2001.1034597","DOIUrl":"https://doi.org/10.1109/ASRU.2001.1034597","url":null,"abstract":"This paper discusses the applicability and role of automatic speech recognition in portable wireless devices. Due to the author's background, the viewpoints are somewhat biased to mobile telephones, but many of the aspects are nevertheless common for other portable devices as well. While still dominated by the speaker-dependent technology, there are today signs that also in wireless devices, there are ASR trends towards speaker-independent systems. As these modern communication devices are usually intended for mass markets, the paper reviews the ASR areas that are relevant for speech recognition on low cost embedded systems. In particular, multilingual ASR, low complexity ASR algorithms and their implementation, and acoustic model adaptation techniques play a key role in enabling cost effective realization of ASR systems. Low complexity and advanced noise robust ASR algorithms are sometimes conflicting concepts. The paper also briefly reviews some of the most important noise robust ASR techniques that are well suited for embedded systems.","PeriodicalId":118671,"journal":{"name":"IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01.","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131501424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}