{"title":"Quantitative evaluation of dialog corpora acquired through different techniques","authors":"D. Griol, L. Hurtado, E. Segarra, E. Arnal","doi":"10.1109/SLT.2008.4777851","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777851","url":null,"abstract":"In this paper, we present the results of the comparison between three corpora acquired by means of different techniques. The first corpus was acquired using the Wizard of Oz technique. A statistical user simulation technique has been developed for the acquisition of the second corpus. In this technique, the next user answer is selected by means of a classification process that takes into account the previous user turns, the last system answer and the objective of the dialog. Finally, a dialog simulation technique has been developed for the acquisition of the third corpus. This technique uses a random selection of the user and system turns, defining stop conditions for automatically deciding if the simulated dialog is successful or not. We use several evaluation measures proposed in previous research to compare between our three acquired corpora, and then discuss the similarities and differences with regard to these measures.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134423898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Real-time speech recognition captioning of events and meetings","authors":"Gilles Boulianne, M. Boisvert, Frédéric Osterrath","doi":"10.1109/SLT.2008.4777874","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777874","url":null,"abstract":"Real-time speech recognition captioning has not progressed much, beyond television broadcast, to other tasks like meetings in the workplace. A number of obstacles prevent this transition, such as proper means to receive and display captions, or on-site shadow speakers costs. More problematic is the insufficient performance of speech recognition for less formal and one-time events. We describe how we developed a mobile platform for remote captioning during trials in several conferences and meetings. We also show that sentence selection based on relative entropy allows training of adequate language models with small amounts of in-domain data, making real-time captioning of an event possible with only a few hours of preparation.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132109688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Go Kuriki, Y. Itoh, K. Kojima, M. Ishigame, Kazuyo Tanaka, Shi-wook Lee
{"title":"Open vocabulary spoken document retrieval by subword sequence obtained from speech recognizer","authors":"Go Kuriki, Y. Itoh, K. Kojima, M. Ishigame, Kazuyo Tanaka, Shi-wook Lee","doi":"10.1109/SLT.2008.4777900","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777900","url":null,"abstract":"We present a method for open vocabulary retrieval based on a spoken document retrieval (SDR) system using subword models. The present paper proposes a new approach to open vocabulary SDR system using subword models which do not require subword recognition. Instead, subword sequences are obtained from the phone sequence outputted containing an out of vocabulary (OOV) word, a speech recognizer outputs a word sequence whose phone sequence is considered to be similar to the OOV word. When OOV words are provided in a query, the proposed system is able to retrieve the target section by comparing the phone sequences of the query and the word sequence generated by the speech recognizer.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123461717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sheng-yi Kong, Chien-Chih Wang, Ko-chien Kuo, Lin-Shan Lee
{"title":"Automatic title generation for Chinese spoken documents with a delicate scored Viterbi algorithm","authors":"Sheng-yi Kong, Chien-Chih Wang, Ko-chien Kuo, Lin-Shan Lee","doi":"10.1109/SLT.2008.4777866","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777866","url":null,"abstract":"Automatic title generation for spoken documents is believed to be an important key for browsing and navigation over huge quantities of multimedia content. A new framework of automatic title generation for Chinese spoken documents is proposed in this paper using a delicate scored Viterbi algorithm performed over automatically generated text summaries of the testing spoken documents. The Viterbi beam search is guided by a delicate score evaluated from three sets of models: term selection model tells the most suitable terms to be included in the title, term ordering model gives the best ordering of the terms to make the title readable, and title length model tells the reasonable length of the title. The models are trained from a training corpus which is not required to be matched with the testing spoken documents. Both objective evaluation based on F1 measure and subjective human evaluation for relevance and readability indicated the approach is very attractive.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122119711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Prasad, C. Moran, F. Choi, R. Meermeier, S. Saleem, C. Kao, D. Stallard, P. Natarajan
{"title":"Name aware speech-to-speech translation for English/Iraqi","authors":"R. Prasad, C. Moran, F. Choi, R. Meermeier, S. Saleem, C. Kao, D. Stallard, P. Natarajan","doi":"10.1109/SLT.2008.4777887","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777887","url":null,"abstract":"In this paper, we describe a novel approach that exploits intra-sentence and dialog-level context for improving translation performance on spoken Iraqi utterances that contain named entities (NEs). Dialog-level context is used to predict whether the Iraqi response is likely to contain names and the intra-sentence context is used to determine words that are named entities. While we do not address the problem of translating out-of-vocabulary (OOV) NEs in spoken utterances, we show that our approach is capable of translating OOV names in text input. To demonstrate efficacy of our approach, we present results on internal test set as well as the 2008 June DARPA TRANSTAC name evaluation set.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115530123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effects of self-disclosure and empathy in human-computer dialogue","authors":"Ryuichiro Higashinaka, Kohji Dohsaka, Hideki Isozaki","doi":"10.1109/SLT.2008.4777852","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777852","url":null,"abstract":"To build trust or cultivate long-term relationships with users, conversational systems need to perform social dialogue. To date, research has primarily focused on the overall effect of social dialogue in human-computer interaction, leading to little work on the effects of individual linguistic phenomena within social dialogue. This paper investigates such individual effects through dialogue experiments. Focusing on self-disclosure and empathic utterances (agreement and disagreement), we empirically calculate their contributions to the dialogue quality. Our analysis shows that (1) empathic utterances by users are strong indicators of increasing closeness and user satisfaction, (2) the system's empathic utterances are effective for inducing empathy from users, and (3) self-disclosure by users increases when users have positive preferences on topics being discussed.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126165988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IslEnquirer: Social user model acquisition through network analysis and interactive learning","authors":"F. Putze, H. Holzapfel","doi":"10.1109/SLT.2008.4777854","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777854","url":null,"abstract":"We present an approach to introduce social awareness in interactive systems. The IslEnquirer is a system which automatically builds social user models. It initializes the models by social network analysis of available offline data. These models are then verified and extended by interactive learning which is carried out by a robot initiated spoken dialog with the user.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130043457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Dinarelli, Alessandro Moschitti, G. Riccardi
{"title":"Joint generative and discriminative models for spoken language understanding","authors":"Marco Dinarelli, Alessandro Moschitti, G. Riccardi","doi":"10.1109/SLT.2008.4777840","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777840","url":null,"abstract":"Spoken Language Understanding aims at mapping a natural language spoken sentence into a semantic representation. In the last decade two main approaches have been pursued: generative and discriminative models. The former is more robust to overfitting whereas the latter is more robust to many irrelevant features. Additionally, the way in which these approaches encode prior knowledge is very different and their relative performance changes based on the task. In this paper we describe a training framework where both models are used: a generative model produces a list of ranked hypotheses whereas a discriminative model, depending on string kernels and Support Vector Machines, re-ranks such list. We tested such approach on a new corpus produced in the European LUNA project. The results show a large improvement on the state-of-the-art in concept segmentation and labeling.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"451 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133270002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maria Georgescul, Manny Rayner, P. Bouillon, Nikos Tsourakis
{"title":"Discriminative learning using linguistic features to rescore n-best speech hypotheses","authors":"Maria Georgescul, Manny Rayner, P. Bouillon, Nikos Tsourakis","doi":"10.1109/SLT.2008.4777849","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777849","url":null,"abstract":"We describe how we were able to improve the accuracy of a medium-vocabulary spoken dialog system by rescoring the list of n-best recognition hypotheses using a combination of acoustic, syntactic, semantic and discourse information. The non-acoustic features are extracted from different intermediate processing results produced by the natural language processing module, and automatically filtered. We apply discriminative support vector learning designed for re-ranking, using both word error rate and semantic error rate as ranking target value, and evaluating using five-fold cross-validation; to show robustness of our method, confidence intervals for word and semantic error rates are computed via bootstrap sampling. The reduction in semantic error rate, from 19% to 11%, is statistically significant at 0.01 level.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131517301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A research bed for unit selection based text to speech synthesis","authors":"K. Sarathy, A. Ramakrishnan","doi":"10.1109/SLT.2008.4777882","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777882","url":null,"abstract":"The paper describes a modular, unit selection based TTS framework, which can be used as a research bed for developing TTS in any new language, as well as studying the effect of changing any parameter during synthesis. Using this framework, TTS has been developed for Tamil. Synthesis database consists of 1027 phonetically rich pre-recorded sentences. This framework has already been tested for Kannada. Our TTS synthesizes intelligible and acceptably natural speech, as supported by high mean opinion scores. The framework is further optimized to suit embedded applications like mobiles and PDAs. We compressed the synthesis speech database with standard speech compression algorithms used in commercial GSM phones and evaluated the quality of the resultant synthesized sentences. Even with a highly compressed database, the synthesized output is perceptually close to that with uncompressed database. Through experiments, we explored the ambiguities in human perception when listening to Tamil phones and syllables uttered in isolation, thus proposing to exploit the misperception to substitute for missing phone contexts in the database. Listening experiments have been conducted on sentences synthesized by deliberately replacing phones with their confused ones.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132890803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}