Peter Bell, M. Gales, P. Lanchantin, Xunying Liu, Yanhua Long, Steve Renals, P. Swietojanski, P. Woodland
{"title":"Transcription of multi-genre media archives using out-of-domain data","authors":"Peter Bell, M. Gales, P. Lanchantin, Xunying Liu, Yanhua Long, Steve Renals, P. Swietojanski, P. Woodland","doi":"10.1109/SLT.2012.6424244","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424244","url":null,"abstract":"We describe our work on developing a speech recognition system for multi-genre media archives. The high diversity of the data makes this a challenging recognition task, which may benefit from systems trained on a combination of in-domain and out-of-domain data. Working with tandem HMMs, we present Multi-level Adaptive Networks (MLAN), a novel technique for incorporating information from out-of-domain posterior features using deep neural networks. We show that it provides a substantial reduction in WER over other systems, with relative WER reductions of 15% over a PLP baseline, 9% over in-domain tandem features and 8% over the best out-of-domain tandem features.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115852854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Svetlana Stoyanchev, Philipp Salletmayr, Jingbo Yang, Julia Hirschberg
{"title":"Localized detection of speech recognition errors","authors":"Svetlana Stoyanchev, Philipp Salletmayr, Jingbo Yang, Julia Hirschberg","doi":"10.1109/SLT.2012.6424164","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424164","url":null,"abstract":"We address the problem of localized error detection in Automatic Speech Recognition (ASR) output. Localized error detection seeks to identify which particular words in a user's utterance have been misrecognized. Identifying misrecognized words permits one to create targeted clarification strategies for spoken dialogue systems, allowing the system to ask clarification questions targeting the particular type of misrecognition, in contrast to the “please repeat/rephrase” strategies used in most current dialogue systems. We present results of machine learning experiments using ASR confidence scores together with prosodic and syntactic features to predict whether 1) an utterance contains an error, and 2) whether a word in a misrecognized utterance is misrecognized. We show that by adding syntactic features to the ASR features when predicting misrecognized utterances the F-measure improves by 13.3% compared to using ASR features alone. By adding syntactic and prosodic features when predicting misrecognized words F-measure improves by 40%.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130162818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Frame-based phonotactic Language Identification","authors":"Kyu Jeong Han, Jason W. Pelecanos","doi":"10.1109/SLT.2012.6424240","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424240","url":null,"abstract":"This paper describes a frame-based phonotactic Language Identification (LID) system, which was used for the LID evaluation of the Robust Automatic Transcription of Speech (RATS) program by the Defense Advanced Research Projects Agency (DARPA). The proposed approach utilizes features derived from frame-level phone log-likelihoods from a phone recognizer. It is an attempt to capture not only phone sequence information but also short-term timing information for phone N-gram events, which is lacking in conventional phonotactic LID systems that simply count phone N-gram events. Based on this new method, we achieved 26% relative improvement in terms of Cavg for the RATS LID evaluation data compared to phone N-gram counts modeling. We also observed that it had a significant impact on score combination with our best acoustic system based on Mel-Frequency Cepstral Coefficients (MFCCs).","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130245453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Imseng, H. Bourlard, Holger Caesar, Philip N. Garner, G. Lecorvé, Alexandre Nanchen
{"title":"MediaParl: Bilingual mixed language accented speech database","authors":"David Imseng, H. Bourlard, Holger Caesar, Philip N. Garner, G. Lecorvé, Alexandre Nanchen","doi":"10.1109/SLT.2012.6424233","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424233","url":null,"abstract":"MediaParl is a Swiss accented bilingual database containing recordings in both French and German as they are spoken in Switzerland. The data were recorded at the Valais Parliament. Valais is a bilingual Swiss canton with many local accents and dialects. Therefore, the database contains data with high variability and is suitable to study multilingual, accented and non-native speech recognition as well as language identification and language switch detection. We also define monolingual and mixed language automatic speech recognition and language identification tasks and evaluate baseline systems. The database is publicly available for download.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133408767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leonardo Badino, Claudia Canevari, L. Fadiga, G. Metta
{"title":"Deep-level acoustic-to-articulatory mapping for DBN-HMM based phone recognition","authors":"Leonardo Badino, Claudia Canevari, L. Fadiga, G. Metta","doi":"10.1109/SLT.2012.6424252","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424252","url":null,"abstract":"In this paper we experiment with methods based on Deep Belief Networks (DBNs) to recover measured articulatory data from speech acoustics. Our acoustic-to-articulatory mapping (AAM) processes go through multi-layered and hierarchical (i.e., deep) representations of the acoustic and the articulatory domains obtained through unsupervised learning of DBNs. The unsupervised learning of DBNs can serve two purposes: (i) pre-training of the Multi-layer Perceptrons that perform AAM; (ii) transformation of the articulatory domain that is recovered from acoustics through AAM. The recovered articulatory features are combined with MFCCs to compute phone posteriors for phone recognition. Tested on the MOCHA-TIMIT corpus, the recovered articulatory features, when combined with MFCCs, lead to up to a remarkable 16.6% relative phone error reduction w.r.t. a phone recognizer that only uses MFCCs.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"192 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124264498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"POMDP-based Let's Go system for spoken dialog challenge","authors":"Sungjin Lee, M. Eskénazi","doi":"10.1109/SLT.2012.6424198","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424198","url":null,"abstract":"This paper describes a POMDP-based Let's Go system which incorporates belief tracking and dialog policy optimization into the dialog manager of the reference system for the Spoken Dialog Challenge (SDC). Since all components except for the dialog manager were kept the same, component-wise comparison can be performed to investigate the effect of belief tracking and dialog policy optimization on the overall system performance. In addition, since unsupervised methods have been adopted to learn all required models to reduce human labor and development time, the effectiveness of the unsupervised approaches compared to conventional supervised approaches can be investigated. The result system participated in the 2011 SDC and showed comparable performance with the base system which has been enhanced from the reference system for the 2010 SDC. This shows the capability of the proposed method to rapidly produce an effective system with minimal human labor and experts' knowledge.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129354376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Karel Veselý, M. Karafiát, F. Grézl, M. Janda, E. Egorova
{"title":"The language-independent bottleneck features","authors":"Karel Veselý, M. Karafiát, F. Grézl, M. Janda, E. Egorova","doi":"10.1109/SLT.2012.6424246","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424246","url":null,"abstract":"In this paper we present novel language-independent bottleneck (BN) feature extraction framework. In our experiments we have used Multilingual Artificial Neural Network (ANN), where each language is modelled by separate output layer, while all the hidden layers jointly model the variability of all the source languages. The key idea is that the entire ANN is trained on all the languages simultaneously, thus the BN-features are not biased towards any of the languages. Exactly for this reason, the final BN-features are considered as language independent. In the experiments with GlobalPhone database, we show that Multilingual BN-features consistently outperform Monolingual BN-features. Also, cross-lingual generalization is evaluated, where we train on 5 source languages and test on 3 other languages. The results show that the ANN can produce very good BN-features even for unseen languages, in some cases even better than if we trained the ANN on the target language only.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127187886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianbo Jiang, Zhiyong Wu, Mingxing Xu, Jia Jia, Lianhong Cai
{"title":"Comparison of adaptation methods for GMM-SVM based speech emotion recognition","authors":"Jianbo Jiang, Zhiyong Wu, Mingxing Xu, Jia Jia, Lianhong Cai","doi":"10.1109/SLT.2012.6424234","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424234","url":null,"abstract":"The required length of the utterance is one of the key factors affecting the performance of automatic emotion recognition. To gain the accuracy rate of emotion distinction, adaptation algorithms that can be manipulated on short utterances are highly essential. Regarding this, this paper compares two classical model adaptation methods, maximum a posteriori (MAP) and maximum likelihood linear regression (MLLR), in GMM-SVM based emotion recognition, and tries to find which method can perform better on different length of the enrollment of the utterances. Experiment results show that MLLR adaptation performs better for very short enrollment utterances (with the length shorter than 2s) while MAP adaptation is more effective for longer utterances.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121163182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Deng, Gökhan Tür, Xiaodong He, Dilek Z. Hakkani-Tür
{"title":"Use of kernel deep convex networks and end-to-end learning for spoken language understanding","authors":"L. Deng, Gökhan Tür, Xiaodong He, Dilek Z. Hakkani-Tür","doi":"10.1109/SLT.2012.6424224","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424224","url":null,"abstract":"We present our recent and ongoing work on applying deep learning techniques to spoken language understanding (SLU) problems. The previously developed deep convex network (DCN) is extended to its kernel version (K-DCN) where the number of hidden units in each DCN layer approaches infinity using the kernel trick. We report experimental results demonstrating dramatic error reduction achieved by the K-DCN over both the Boosting-based baseline and the DCN on a domain classification task of SLU, especially when a highly correlated set of features extracted from search query click logs are used. Not only can DCN and K-DCN be used as a domain or intent classifier for SLU, they can also be used as local, discriminative feature extractors for the slot filling task of SLU. The interface of K-DCN to slot filling systems via the softmax function is presented. Finally, we outline an end-to-end learning strategy for training the softmax parameters (and potentially all DCN and K-DCN parameters) where the learning objective can take any performance measure (e.g. the F-measure) for the full SLU system.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125176100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Blaise Thomson, Milica Gasic, Matthew Henderson, P. Tsiakoulis, S. Young
{"title":"N-best error simulation for training spoken dialogue systems","authors":"Blaise Thomson, Milica Gasic, Matthew Henderson, P. Tsiakoulis, S. Young","doi":"10.1109/SLT.2012.6424194","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424194","url":null,"abstract":"A recent trend in spoken dialogue research is the use of reinforcement learning to train dialogue systems in a simulated environment. Past researchers have shown that the types of errors that are simulated can have a significant effect on simulated dialogue performance. Since modern systems typically receive an N-best list of possible user utterances, it is important to be able to simulate a full N-best list of hypotheses. This paper presents a new method for simulating such errors based on logistic regression, as well as a new method for simulating the structure of N-best lists of semantics and their probabilities, based on the Dirichlet distribution. Off-line evaluations show that the new Dirichlet model results in a much closer match to the receiver operating characteristics (ROC) of the live data. Experiments also show that the logistic model gives confusions that are closer to the type of confusions observed in live situations. The hope is that these new error models will be able to improve the resulting performance of trained dialogue systems.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122673526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}