{"title":"Progress towards speech models that model speech","authors":"Martin Russell","doi":"10.1109/ASRU.1997.658995","DOIUrl":"https://doi.org/10.1109/ASRU.1997.658995","url":null,"abstract":"This paper presents a personal view of recent advances in automatic speech recognition. The analysis is concerned with progress in speech pattern modelling, rather than recogniser performance. Despite the limitations of current approaches, it is argued that extension and development of these techniques provides a viable way forward. It is further suggested that the significance of a number of recent developments, such as sub-band speech recognition and segment modelling, is primarily in their potential for overcoming fundamental limitations of current HMM-based approaches, and not in the short-term improvement in recognition accuracy which has been achieved.","PeriodicalId":253278,"journal":{"name":"1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121252731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning dialogue strategies within the Markov decision process framework","authors":"E. Levin, R. Pieraccini, W. Eckert","doi":"10.1109/ASRU.1997.658989","DOIUrl":"https://doi.org/10.1109/ASRU.1997.658989","url":null,"abstract":"We introduce a stochastic model for dialogue systems based on the Markov decision process. Within this framework we show that the problem of dialogue strategy design can be stated as an optimization problem, and solved by a variety of methods, including the reinforcement learning approach. The advantages of this new paradigm include objective evaluation of dialogue systems and their automatic design and adaptation. We show some preliminary results on learning a dialogue strategy for an air travel information system.","PeriodicalId":253278,"journal":{"name":"1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125041186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER)","authors":"J. Fiscus","doi":"10.1109/ASRU.1997.659110","DOIUrl":"https://doi.org/10.1109/ASRU.1997.659110","url":null,"abstract":"Describes a system developed at NIST to produce a composite automatic speech recognition (ASR) system output when the outputs of multiple ASR systems are available, and for which, in many cases, the composite ASR output has a lower error rate than any of the individual systems. The system implements a \"voting\" or rescoring process to reconcile differences in ASR system outputs. We refer to this system as the NIST Recognizer Output Voting Error Reduction (ROVER) system. As additional knowledge sources are added to an ASR system (e.g. acoustic and language models), error rates are typically decreased. This paper describes a post-recognition process which models the output generated by multiple ASR systems as independent knowledge sources that can be combined and used to generate an output with reduced error rate. To accomplish this, the outputs of multiple of ASR systems are combined into a single, minimal-cost word transition network (WTN) via iterative applications of dynamic programming (DP) alignments. The resulting network is searched by an automatic rescoring or \"voting\" process that selects the output sequence with the lowest score.","PeriodicalId":253278,"journal":{"name":"1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130407240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stream derivation and clustering scheme for subspace distribution clustering hidden Markov model","authors":"Brian Mak, Enrico Bocchieri, Etienne Barnard","doi":"10.1109/ASRU.1997.659109","DOIUrl":"https://doi.org/10.1109/ASRU.1997.659109","url":null,"abstract":"Bocchieri and Mak (Proc. Eurospeech, vol. 1, p. 107-10, 1997) introduced a novel subspace distribution clustering hidden Markov model (SDCHMM) as an approximation to a continuous-density HMM (CDHMM). Deriving SDCHMMs from CDHMMs requires a definition of multiple streams and a Gaussian clustering scheme. Previously, we have tried 4 and 13 streams, which are common but ad hoc choices. In this paper, we present a simple and coherent definition for streams of any dimension: the streams comprise the most correlated features. The new definition is shown to give better performance in two speech recognition tasks. The clustering scheme of Bocchieri and Mak is an O(n/sup 2/) algorithm which can be slow when the number of Gaussians in the original CDHMMs is large. Now, we have devised a modified k-means clustering scheme using the Bhattacharyya distance as the distance measure between Gaussian clusters. Not only is the new clustering scheme faster but, when combined with the new stream definitions, we now obtain SDCHMMs which perform at least as well as the original CDHMMs (with better results in some cases).","PeriodicalId":253278,"journal":{"name":"1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings","volume":"29 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123162329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Byrne, M. Finke, S. Khudanpur, J. McDonough, H. Nock, M. Riley, M. Saraçlar, Chuck Wooters, G. Zavaliagkos
{"title":"Pronunciation modelling for conversational speech recognition: a status report from WS97","authors":"B. Byrne, M. Finke, S. Khudanpur, J. McDonough, H. Nock, M. Riley, M. Saraçlar, Chuck Wooters, G. Zavaliagkos","doi":"10.1109/ASRU.1997.658973","DOIUrl":"https://doi.org/10.1109/ASRU.1997.658973","url":null,"abstract":"Accurately modelling of pronunciation variability in conversational speech is an important component for automatic speech recognition. We describe some of the projects undertaken in this direction at WS97 [the Fifth LVCSR (large-vocabulary conversational speech recognition) Summer Workshop], held at Johns Hopkins University, Baltimore, in July-August 1997. We first illustrate a use of hand-labelled phonetic transcriptions of a portion of the Switchboard corpus, in conjunction with statistical techniques, to learn alternatives to canonical pronunciations of words. We then describe the use of these alternative pronunciations in a recognition experiment as well as in the acoustic training of an automatic speech recognition system. Our results show a reduction of the word error rate in both cases-0.9% without acoustic retraining and 2.2% with acoustic retraining.","PeriodicalId":253278,"journal":{"name":"1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122450741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Phonetically adaptive cepstrum mean normalization for acoustic mismatch compensation","authors":"M. Morishima, T. Isobe, J. Takahashi","doi":"10.1109/ASRU.1997.659121","DOIUrl":"https://doi.org/10.1109/ASRU.1997.659121","url":null,"abstract":"We propose a new technique that compensates for an acoustic mismatch. This technique is simple and can estimate the acoustic mismatch more accurately than conventional cepstrum mean normalization (CMN), because it takes into consideration the kind of phonemes and their frequency, and can calculate the acoustic mismatch in detail. In this procedure the acoustic mismatch can be estimated as the difference between the centroid vector of distorted speech and that of acoustic models. The cepstral mean of distorted speech is the centroid vector including the distortion. The centroid vector calculated from parameters of acoustic models is regarded as the centroid vector when the distorted speech is assumed to be clean speech. The acoustic models used for calculation are for phonemes that appear in the transcription of the speech. This technique achieves a high word error reduction rate of 73% for ordinary analog telephone speech and 70% for wireless telephone handset speech.","PeriodicalId":253278,"journal":{"name":"1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126279023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Synergistic modalities for human/machine communication","authors":"J. Flanagan","doi":"10.1109/ASRU.1997.658967","DOIUrl":"https://doi.org/10.1109/ASRU.1997.658967","url":null,"abstract":"Natural communication with machines is a crucial factor in bringing the benefits of networked computers to mass markets. In particular, the sensory dimensions of sight, sound and touch are comfortable and convenient modalities for the human user. New technologies are now emerging in these domains that can support human/machine communication with features that emulate face-to-face interaction. A current challenge is how to integrate the, as yet, imperfect technologies to achieve synergies that transcend the benefit of a single modality. Because speech is a preferred means for human information exchange, conversational interaction with machines will play a central role in collaborative knowledge work mediated by networked computers. Utilizing speech in combination with simultaneous visual gestures and haptic signalling requires software agents that are able to fuse the error-susceptible sensory information into reliable interpretations that are responsive to (and anticipatory of) human user intentions. This report draws a perspective on research in human/machine communication technologies aimed at supporting computer conferencing and collaborative problem solving.","PeriodicalId":253278,"journal":{"name":"1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125302442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A statistical language modeling approach integrating local and global constraints","authors":"J. Bellegarda","doi":"10.1109/ASRU.1997.659014","DOIUrl":"https://doi.org/10.1109/ASRU.1997.659014","url":null,"abstract":"A new framework is proposed to integrate the various constraints, both local and global, that are present in language. Local constraints are captured via n-gram language modeling, while global constraints are taken into account through the use of latent semantic analysis. An integrative formulation is derived for the combination of these two paradigms, resulting in several families of multi-span language models for large-vocabulary speech recognition. Because of the inherent complementarity in the two types of constraints, the performance of the integrated language models, as measured by perplexity, compares favorably with the corresponding n-gram performance.","PeriodicalId":253278,"journal":{"name":"1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings","volume":"22 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131687241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Variable threshold vector quantization for reduced continuous density likelihood computation in speech recognition","authors":"S. Herman, R.A. Sukkar","doi":"10.1109/ASRU.1997.659108","DOIUrl":"https://doi.org/10.1109/ASRU.1997.659108","url":null,"abstract":"Vector quantization (VQ) has been explored in the past as a means of achieving reductions in likelihood computation for hidden Markov models (HMMs) which use Gaussian mixtures for their output densities. In this paper, we present a new method for choosing which mixtures can be discarded for each pair of HMM state and vector quantization index. Traditionally, a global threshold was used to specify the maximum distance a mixture mean could lie from a VQ codeword before being considered negligible in likelihood calculations for observation vectors contained in that VQ cell. Our technique uses a threshold which varies with VQ cell volume. Thus, larger cells are allocated more mixtures than smaller cells, in order to provide a more uniform coverage of the acoustic space and thereby improve computational efficiency.","PeriodicalId":253278,"journal":{"name":"1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132574118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A tonotopic artificial neural network architecture for phoneme probability estimation","authors":"N. Strom","doi":"10.1109/ASRU.1997.659000","DOIUrl":"https://doi.org/10.1109/ASRU.1997.659000","url":null,"abstract":"A novel sparse ANN connection scheme is proposed. It is inspired by the so called tonotopic organization of the auditory nerve, and allows a more detailed representation of the speech spectrum to be input to an ANN than is commonly used. A consequence of the new connection scheme is that more resources are allocated to analysis within narrow frequency sub bands-a concept that has recently been investigated by others with so called sub band ASR. ANNs with the proposed architecture have been evaluated on the TIMIT database for phoneme recognition, and are found to give better phoneme recognition performance than ANNs based on standard mel frequency cepstrum input. The lowest achieved phone error rate, 26.7%, is very close to the lowest published result for the core test set of the TIMIT database.","PeriodicalId":253278,"journal":{"name":"1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131786094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}