{"title":"A comparison-based approach to mispronunciation detection","authors":"Ann Lee, James R. Glass","doi":"10.1109/SLT.2012.6424254","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424254","url":null,"abstract":"The task of mispronunciation detection for language learning is typically accomplished via automatic speech recognition (ASR). Unfortunately, less than 2% of the world's languages have an ASR capability, and the conventional process of creating an ASR system requires large quantities of expensive, annotated data. In this paper we report on our efforts to develop a comparison-based framework for detecting word-level mispronunciations in nonnative speech. Dynamic time warping (DTW) is carried out between a student's (non-native speaker) utterance and a teacher's (native speaker) utterance, and we focus on extracting word-level and phone-level features that describe the degree of mis-alignment in the warping path and the distance matrix. Experimental results on a Chinese University of Hong Kong (CUHK) nonnative corpus show that the proposed framework improves the relative performance on a mispronounced word detection task by nearly 50% compared to an approach that only considers DTW alignment scores.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132535089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Felix Stahlberg, Tim Schlippe, S. Vogel, Tanja Schultz
{"title":"Word segmentation through cross-lingual word-to-phoneme alignment","authors":"Felix Stahlberg, Tim Schlippe, S. Vogel, Tanja Schultz","doi":"10.1109/SLT.2012.6424202","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424202","url":null,"abstract":"We present our new alignment model Model 3P for cross-lingual word-to-phoneme alignment, and show that unsupervised learning of word segmentation is more accurate when information of another language is used. Word segmentation with cross-lingual information is highly relevant to bootstrap pronunciation dictionaries from audio data for Automatic Speech Recognition, bypass the written form in Speech-to-Speech Translation or build the vocabulary of an unseen language, particularly in the context of under-resourced languages. Using Model 3P for the alignment between English words and Spanish phonemes outperforms a state-of-the-art monolingual word segmentation approach [1] on the BTEC corpus [2] by up to 42% absolute in F-Score on the phoneme level and a GIZA++ alignment based on IBM Model 3 by up to 17%.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114350942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabrizio Morbini, Kartik Audhkhasi, Ron Artstein, Maarten Van Segbroeck, Kenji Sagae, P. Georgiou, D. Traum, Shrikanth S. Narayanan
{"title":"A reranking approach for recognition and classification of speech input in conversational dialogue systems","authors":"Fabrizio Morbini, Kartik Audhkhasi, Ron Artstein, Maarten Van Segbroeck, Kenji Sagae, P. Georgiou, D. Traum, Shrikanth S. Narayanan","doi":"10.1109/SLT.2012.6424196","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424196","url":null,"abstract":"We address the challenge of interpreting spoken input in a conversational dialogue system with an approach that aims to exploit the close relationship between the tasks of speech recognition and language understanding through joint modeling of these two tasks. Instead of using a standard pipeline approach where the output of a speech recognizer is the input of a language understanding module, we merge multiple speech recognition and utterance classification hypotheses into one list to be processed by a joint reranking model. We obtain substantially improved performance in language understanding in experiments with thousands of user utterances collected from a deployed spoken dialogue system.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125529361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speaker diarization and linking of large corpora","authors":"Marc Ferras, Herve Boudard","doi":"10.1109/SLT.2012.6424236","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424236","url":null,"abstract":"Performing speaker diarization of a collection of recordings, where speakers are uniquely identified across the database, is a challenging task. In this context, inter-session variability compensation and reasonable computation times are essential to be addressed. In this paper we propose a two-stage system composed of speaker diarization and speaker linking modules that are able to perform data set wide speaker diarization and that handle both large volumes of data and inter-session variability compensation. The speaker linking system agglomeratively clusters speaker factor posterior distributions, obtained within the Joint Factor Analysis framework, that model the speaker clusters output by a standard speaker diarization system. Therefore, the technique inherently compensates the channel variability effects from recording to recording within the database. A threshold is used to obtain meaningful speaker clusters by cutting the dendrogram obtained by the agglomerative clustering. We show how the Hotteling t-square statistic is an interesting distance measure for this task and input data, obtaining the best results and stability. The system is evaluated using three subsets of the AMI corpus involving different speaker and channel variabilities. We use the within-recording and across-recording diarization error rates (DER), cluster purity and cluster coverage to measure the performance of the proposed system. Across-recording DER as low as within-recording DER are obtained for some system setups.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128643104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Incorporating syllable duration into line-detection-based spoken term detection","authors":"Teppei Ohno, T. Akiba","doi":"10.1109/SLT.2012.6424223","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424223","url":null,"abstract":"A conventional method for spoken term detection (STD) is to apply approximate string matching to subword sequences in a spoken document obtained by speech recognition. An STD method that considers string matching as line detection in a syllable distance plane has been proposed. While this has demonstrated fast ordered-by-distance detections, it has still suffered from the insertion and deletion errors introduced by the speech recognition. In this work, we aim to improve detection performance by employing syllable-duration information. The proposed method enables robust detection by introducing a distance plane that uses frames as units instead of using syllables as units. Our experimental evaluation showed that the incorporation of syllable-duration achieved higher detection performance in high-recall regions.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126119673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A grapheme-based method for automatic alignment of speech and text data","authors":"Adriana Stan, P. Bell, Simon King","doi":"10.1109/SLT.2012.6424237","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424237","url":null,"abstract":"This paper introduces a method for automatic alignment of speech data with unsynchronised, imperfect transcripts, for a domain where no initial acoustic models are available. Using grapheme-based acoustic models, word skip networks and orthographic speech transcripts, we are able to harvest 55% of the speech with a 93% utterance-level accuracy and 99% word accuracy for the produced transcriptions. The work is based on the assumption that there is a high degree of correspondence between the speech and text, and that a full transcription of all of the speech is not required. The method is language independent and the only prior knowledge and resources required are the speech and text transcripts, and a few minor user interventions.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133588641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Realistic answer verification: An analysis of user errors in a sentence-repetition task","authors":"S. Shirali-Shahreza, Gerald Penn","doi":"10.1109/SLT.2012.6424163","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424163","url":null,"abstract":"Speech authentication protocols should have a challenge/response feature to be protected against replay attacks. As a result, they need to verify whether the user responded to an interactive prompt. However, it is usually assumed that the user will provide their answer perfectly. In this paper, we report on an ecologically valid user study that we conducted to test this assumption. Our results show that 40% of user answers are imperfect, even in a task as simple as sentence repetition. Error analysis reveals that 60% of the imperfect answers contain small errors that should be deemed acceptable, which increases the total acceptance rate of this task to 84%. We also tested a forced alignment algorithm as a means of verifying answers automatically.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130490456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using rhythmic features for Japanese spoken term detection","authors":"Naoyuki Kanda, Ryu Takeda, Y. Obuchi","doi":"10.1109/SLT.2012.6424217","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424217","url":null,"abstract":"A new rescoring method for spoken term detection (STD) is proposed. Phoneme-based close-matching techniques have been used because of their ability to detect out-of-vocabulary (OOV) queries. To improve the accuracy of phoneme-based techniques, rescoring techniques have been used to accurately re-rank the results from phoneme-based close-matching; however, conventional rescoring techniques based on an utterance verification model still produce many false detection results. To further improve the accuracy, in this study, several features representing the “naturalness” (or “abnormality”) of duration of phonemes/syllables in detected candidates of a keyword are proposed. These features are incorporated into a conventional rescoring technique using logistic regression. Experimental results with a 604-hour Japanese speech corpus indicated that combining the rhythmic features achieved a further relative error reduction of 8.9% compared to a conventional rescoring technique.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117036998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Affective evaluation of a mobile multimodal dialogue system using brain signals","authors":"M. Perakakis, A. Potamianos","doi":"10.1109/SLT.2012.6424195","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424195","url":null,"abstract":"We propose the use of affective metrics such as excitement, frustration and engagement for the evaluation of multimodal dialogue systems. The affective metrics are elicited from the ElectroEncephaloGraphy (EEG) signals using the Emotiv EPOC neuroheadset device. The affective metrics are used in conjunction with traditional evaluation metrics (turn duration, input modality) to investigate the effect of speech recognition errors and modality usage patterns in a multimodal (touch and speech) dialogue form-filling application for the iPhone mobile device. Results show that: (1) engagement is higher for touch input, while excitement and frustration is higher for speech input, and (2) speech recognition errors and associated repairs correspond to specific dynamic patters of excitement and frustration. Use of such physiological channels and their elaborated interpretation is a challenging but also a potentially rewarding direction towards emotional and cognitive assessment of multimodal interaction design.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128196296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Luan, Masayuki Suzuki, Yutaka Yamauchi, N. Minematsu, Shuhei Kato, K. Hirose
{"title":"Performance improvement of automatic pronunciation assessment in a noisy classroom","authors":"Yi Luan, Masayuki Suzuki, Yutaka Yamauchi, N. Minematsu, Shuhei Kato, K. Hirose","doi":"10.1109/SLT.2012.6424262","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424262","url":null,"abstract":"In recent years Computer-Assisted Language Learning (CALL) systems have been widely used in foreign language education. Some systems use automatic speech recognition (ASR) technologies to detect pronunciation errors and estimate the proficiency level of individual students. When speech recording is done in a CALL classroom, however, utterances of a student are always recorded with those of the others in the same class. The latter utterances are just background noise, and the performance of automatic pronunciation assessment is degraded especially when a student is surrounded with very active students. To solve this problem, we apply a noise reduction technique, Stereo-based Piecewise Linear Compensation for Environments (SPLICE), and the compensated feature sequences are input to a Goodness Of Pronunciation (GOP) assessment system. Results show that SPLICE-based noise reduction works very well as a means to improve the assessment performance in a noisy classroom.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133319366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}