{"title":"Relationship between dialogue acts and hot spots in meetings","authors":"B. Wrede, Elizabeth Shriberg","doi":"10.1109/ASRU.2003.1318425","DOIUrl":"https://doi.org/10.1109/ASRU.2003.1318425","url":null,"abstract":"We examine the relationship between hot spots (annotated in terms of involvement) and dialogue acts (DAs, annotated in an independent effort) in roughly 32 hours of speech data from naturally-occurring meetings. Results reveal that four independently-motivated involvement categories (non-involved, disagreeing, amused, and other) show statistically significant associations with particular DAs. Further examination shows that involvement is associated with contextual features (such as the speaker or type of meeting), as well as with lexical features (such as utterance length and perplexity). Finally, we found (surprisingly) that perplexities are similar for involved and non-involved utterances. This suggests that it may not be the amount of propositional content, but rather participants' attitudes toward that content, that differentiates hot spots from other regions in a meeting. Overall, these specific correlations, and their relationships to other features, such as perplexity, could provide useful information for the automatic archiving and browsing of natural meetings.","PeriodicalId":394174,"journal":{"name":"2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127143009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A noise-robust ASR front-end using Wiener filter constructed from MMSE estimation of clean speech and noise","authors":"Jian Wu, J. Droppo, L. Deng, A. Acero","doi":"10.1109/ASRU.2003.1318461","DOIUrl":"https://doi.org/10.1109/ASRU.2003.1318461","url":null,"abstract":"In this paper, we present a novel two-stage framework for designing a noise-robust front-end for automatic speech recognition. In the first stage, a parametric model of acoustic distortion is used to estimate the clean speech and noise spectra in a principled way so that no heuristic parameters need to be set manually. To reduce possible flaws caused by the simplifying assumptions in the parametric model, a second-stage Wiener filtering is applied to further reduce the noise while preserving speech spectra unharmed. This front-end is evaluated on the Aurora2 task. For the multi-condition training scenario, a relative error reduction of 28.4% is achieved.","PeriodicalId":394174,"journal":{"name":"2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122475451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balancing data-driven and rule-based approaches in the context of a Multimodal Conversational System","authors":"S. Bangalore, Michael Johnston","doi":"10.1109/ASRU.2003.1318444","DOIUrl":"https://doi.org/10.1109/ASRU.2003.1318444","url":null,"abstract":"We address the issue of combining data-driven and grammar-based models for rapid prototyping of a multimodal conversational system. Moderate-sized rule-based spoken language models for recognition and understanding are easy to develop and provide the ability to prototype conversational applications rapidly. However, scalability of such systems is a bottleneck due to the heavy cost of authoring and maintenance of rule sets and inevitable brittleness due to lack of coverage in the rule sets. In contrast, data-driven approaches are robust and the procedure for model building is usually simple. However, the lack of data in an application context limits the ability to build data-driven models, especially in multimodal systems. We also present methods that reuse data from different domains and investigate the limits of such models in the context of an application domain.","PeriodicalId":394174,"journal":{"name":"2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129567463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"In search of optimal data selection for training of automatic speech recognition systems","authors":"A. Nagroski, L. Boves, H. Steeneken","doi":"10.1109/ASRU.2003.1318405","DOIUrl":"https://doi.org/10.1109/ASRU.2003.1318405","url":null,"abstract":"This paper presents an extended study in the topic of optimal selection of speech data from a database for efficient training of ASR systems. We reconsider a method of optimal selection introduced in our previous work and introduce variosearch as an alternative selection method developed in order to find a representative sample of speech data with a simultaneous control of acoustical and statistical parameters of data selected. Next, we present experiments in which the performance of a standard ASR system trained with data sets selected from a Dutch digits database via different selection methods was compared. The results show that the length of training utterances has a dominant impact on the recognition performance. Therefore, the length of the utterances is a factor that must be taken into account when interpreting phoneme recognition scores.","PeriodicalId":394174,"journal":{"name":"2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128578190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Satoshi Nakamura, Kazumasa Yamamoto, K. Takeda, S. Kuroiwa, N. Kitaoka, Takeshi Yamada, M. Mizumachi, T. Nishiura, M. Fujimoto, A. Saso, Toshiki Endo
{"title":"Data collection and evaluation of AURORA-2 Japanese corpus [speech recognition applications]","authors":"Satoshi Nakamura, Kazumasa Yamamoto, K. Takeda, S. Kuroiwa, N. Kitaoka, Takeshi Yamada, M. Mizumachi, T. Nishiura, M. Fujimoto, A. Saso, Toshiki Endo","doi":"10.1109/ASRU.2003.1318511","DOIUrl":"https://doi.org/10.1109/ASRU.2003.1318511","url":null,"abstract":"Speech recognition systems must still be improved when they are exposed to noisy environments. For this improvement, developments of the standard evaluation corpus and assessment technologies are essential. Recently, the AURORA-2,3 corpus and their evaluation scenarios have had significant impact on noisy speech recognition research. This paper introduces a Japanese noisy speech corpus and its evaluation scripts, called AURORA-2J The AURORA-2J is a Japanese connected digits corpus. The data collection and evaluation scenarios are designed in the same way as AURORA-2 with the help of the ETSI AURORA group. Furthermore, we have collected an in-car speech corpus similar to AURORA-3. The in-car speech corpus includes Japanese connected digits and command words collected in a moving car. This paper describes the data collection, baseline scripts, and its baseline performance.","PeriodicalId":394174,"journal":{"name":"2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123901523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interactive grammar inference with finite state transducers","authors":"S. Caskey, Ezra Story, R. Pieraccini","doi":"10.1109/ASRU.2003.1318503","DOIUrl":"https://doi.org/10.1109/ASRU.2003.1318503","url":null,"abstract":"We propose a method for improving the coverage of handcrafted context free grammars based on a set of new sentence examples. The described algorithm aims at finding the minimal set of modifications to the grammar that increase its coverage while preserving its original structure. The algorithm is based on a finite state transducer (FST) representation of context free grammars. The inference method includes an interactive component that allows developers to control the generalization of the new grammar.","PeriodicalId":394174,"journal":{"name":"2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128049685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pronunciation modeling for names of foreign origin","authors":"B. Maison, S.F. Chen, P. S. Cohen","doi":"10.1109/ASRU.2003.1318479","DOIUrl":"https://doi.org/10.1109/ASRU.2003.1318479","url":null,"abstract":"The pronunciation of a proper name is influenced by both a speaker's native language as well as the language of origin of the name itself. Thus, creating suitable sets of pronunciations for names in speech recognition applications is extremely challenging. We investigate whether automatic language identification and grapheme-to-phoneme conversion algorithms can be effective for this task. We train grapheme-to-phoneme models for eight foreign languages and use automatic language identification to select the models with which to generate additional pronunciations for words in a baseline pronunciation dictionary. As compared to the baseline dictionary in a US name recognition task, we achieve a 25% reduction in sentence-error rate for foreign names spoken by native speakers of the language in question, and a 10% reduction in sentence-error rate for foreign names spoken by American speakers.","PeriodicalId":394174,"journal":{"name":"2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130435564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved language model adaptation using existing and derived external resources","authors":"Pi-Chuan Chang, Lin-Shan Lee","doi":"10.1109/ASRU.2003.1318496","DOIUrl":"https://doi.org/10.1109/ASRU.2003.1318496","url":null,"abstract":"The adaptation of language models to obtain better parameters for the topics addressed by the spoken documents to be recognized has been a key issue for speech recognition. In this paper, we propose to collect existing as well as derived external resources for improved language model adaptation. The derived external resources are those retrieved, based on the baseline transcriptions for the input spoken documents, from the Internet using a search engine. The design of queries for such purposes is also analyzed in this paper, in which the special structure of the Chinese language is considered. The obtained existing and derived external resources are then used in the model adaptation, under a clustering-classification framework. Very encouraging results were obtained in the preliminary experiments with two test sets: broadcast news and interview recording.","PeriodicalId":394174,"journal":{"name":"2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134123951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Raymond, Y. Estève, F. Béchet, R. de Mori, Géraldine Damnati
{"title":"Belief confirmation in spoken dialog systems using confidence measures","authors":"C. Raymond, Y. Estève, F. Béchet, R. de Mori, Géraldine Damnati","doi":"10.1109/ASRU.2003.1318420","DOIUrl":"https://doi.org/10.1109/ASRU.2003.1318420","url":null,"abstract":"The approach proposed is an alternative to the traditional architecture of spoken dialogue systems where the system belief is either not taken into account during the automatic speech recognition process or included in the decoding process but never challenged. By representing all the conceptual structures handled by the dialogue manager by finite state machines and by building a conceptual model that contains all the possible interpretations of a given word-graph, we propose a decoding architecture that searches first for the best conceptual interpretation before looking for the best string of words. Once both N-best sets (at the concept level and at the word level) are generated, a verification process is performed on each N-best set using acoustic and linguistic confidence measures. A first selection strategy that does not include for the moment the dialogue context is proposed and significant error reduction on the understanding measures are obtained.","PeriodicalId":394174,"journal":{"name":"2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134620807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VTLN-based cross-language voice conversion","authors":"D. Sündermann, H. Ney, H. Höge","doi":"10.1109/ASRU.2003.1318521","DOIUrl":"https://doi.org/10.1109/ASRU.2003.1318521","url":null,"abstract":"In speech recognition, vocal tract length normalization (VTLN) is a well-studied technique for speaker normalization. As cross-language voice conversion aims at the transformation of a source speaker's voice into that of a target speaker using a different language, we want to investigate whether VTLN is an appropriate method to adapt the voice characteristics. After applying several conventional VTLN warping functions, we extend the conventional piece-wise linear function to several segments, allowing a more detailed warping of the source spectrum. Experiments on cross-language voice conversion are performed on three corpora of two languages and both speaker genders.","PeriodicalId":394174,"journal":{"name":"2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117146414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}