Yun-Nung (Vivian) Chen, William Yang Wang, Alexander I. Rudnicky
{"title":"Unsupervised induction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing","authors":"Yun-Nung (Vivian) Chen, William Yang Wang, Alexander I. Rudnicky","doi":"10.1109/ASRU.2013.6707716","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707716","url":null,"abstract":"Spoken dialogue systems typically use predefined semantic slots to parse users' natural language inputs into unified semantic representations. To define the slots, domain experts and professional annotators are often involved, and the cost can be expensive. In this paper, we ask the following question: given a collection of unlabeled raw audios, can we use the frame semantics theory to automatically induce and fill the semantic slots in an unsupervised fashion? To do this, we propose the use of a state-of-the-art frame-semantic parser, and a spectral clustering based slot ranking model that adapts the generic output of the parser to the target semantic space. Empirical experiments on a real-world spoken dialogue dataset show that the automatically induced semantic slots are in line with the reference slots created by domain experts: we observe a mean averaged precision of 69.36% using ASR-transcribed data. Our slot filling evaluations also indicate the promising future of this proposed approach.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130285528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonas Gehring, Quoc Bao Nguyen, Florian Metze, A. Waibel
{"title":"DNN acoustic modeling with modular multi-lingual feature extraction networks","authors":"Jonas Gehring, Quoc Bao Nguyen, Florian Metze, A. Waibel","doi":"10.1109/ASRU.2013.6707754","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707754","url":null,"abstract":"In this work, we propose several deep neural network architectures that are able to leverage data from multiple languages. Modularity is achieved by training networks for extracting high-level features and for estimating phoneme state posteriors separately, and then combining them for decoding in a hybrid DNN/HMM setup. This approach has been shown to achieve superior performance for single-language systems, and here we demonstrate that feature extractors benefit significantly from being trained as multi-lingual networks with shared hidden representations. We also show that existing mono-lingual networks can be re-used in a modular fashion to achieve a similar level of performance without having to train new networks on multi-lingual data. Furthermore, we investigate in extending these architectures to make use of language-specific acoustic features. Evaluations are performed on a low-resource conversational telephone speech transcription task in Vietnamese, while additional data for acoustic model training is provided in Pashto, Tagalog, Turkish, and Cantonese. Improvements of up to 17.4% and 13.8% over mono-lingual GMMs and DNNs, respectively, are obtained.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125431380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keith D. Levin, Katharine Henry, A. Jansen, Karen Livescu
{"title":"Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings","authors":"Keith D. Levin, Katharine Henry, A. Jansen, Karen Livescu","doi":"10.1109/ASRU.2013.6707765","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707765","url":null,"abstract":"Measures of acoustic similarity between words or other units are critical for segmental exemplar-based acoustic models, spoken term discovery, and query-by-example search. Dynamic time warping (DTW) alignment cost has been the most commonly used measure, but it has well-known inadequacies. Some recently proposed alternatives require large amounts of training data. In the interest of finding more efficient, accurate, and low-resource alternatives, we consider the problem of embedding speech segments of arbitrary length into fixed-dimensional spaces in which simple distances (such as cosine or Euclidean) serve as a proxy for linguistically meaningful (phonetic, lexical, etc.) dissimilarities. Such embeddings would enable efficient audio indexing and permit application of standard distance learning techniques to segmental acoustic modeling. In this paper, we explore several supervised and unsupervised approaches to this problem and evaluate them on an acoustic word discrimination task. We identify several embedding algorithms that match or improve upon the DTW baseline in low-resource settings.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"645 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120876196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-supervised training of Deep Neural Networks","authors":"Karel Veselý, M. Hannemann, L. Burget","doi":"10.1109/ASRU.2013.6707741","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707741","url":null,"abstract":"In this paper we search for an optimal strategy for semi-supervised Deep Neural Network (DNN) training. We assume that a small part of the data is transcribed, while the majority of the data is untranscribed. We explore self-training strategies with data selection based on both the utterance-level and frame-level confidences. Further on, we study the interactions between semi-supervised frame-discriminative training and sequence-discriminative sMBR training. We found it beneficial to reduce the disproportion in amounts of transcribed and untranscribed data by including the transcribed data several times, as well as to do a frame-selection based on per-frame confidences derived from confusion in a lattice. For the experiments, we used the Limited language pack condition for the Surprise language task (Vietnamese) from the IARPA Babel program. The absolute Word Error Rate (WER) improvement for frame cross-entropy training is 2.2%, this corresponds to WER recovery of 36% when compared to the identical system, where the DNN is built on the fully transcribed data.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128043740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effective pseudo-relevance feedback for language modeling in speech recognition","authors":"Berlin Chen, Yi-Wen Chen, Kuan-Yu Chen, E. Jan","doi":"10.1109/ASRU.2013.6707698","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707698","url":null,"abstract":"A part and parcel of any automatic speech recognition (ASR) system is language modeling (LM), which helps to constrain the acoustic analysis, guide the search through multiple candidate word strings, and quantify the acceptability of the final output hypothesis given an input utterance. Despite the fact that the n-gram model remains the predominant one, a number of novel and ingenious LM methods have been developed to complement or be used in place of the n-gram model. A more recent line of research is to leverage information cues gleaned from pseudo-relevance feedback (PRF) to derive an utterance-regularized language model for complementing the n-gram model. This paper presents a continuation of this general line of research and its main contribution is two-fold. First, we explore an alternative and more efficient formulation to construct such an utterance-regularized language model for ASR. Second, the utilities of various utterance-regularized language models are analyzed and compared extensively. Empirical experiments on a large vocabulary continuous speech recognition (LVCSR) task demonstrate that our proposed language models can offer substantial improvements over the baseline n-gram system, and achieve performance competitive to, or better than, some state-of-the-art language models.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"215 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131620207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic entity detection from multiple ASR hypotheses within the WFST framework","authors":"J. Svec, P. Ircing, L. Smídl","doi":"10.1109/ASRU.2013.6707710","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707710","url":null,"abstract":"The paper presents a novel approach to named entity detection from ASR lattices. Since the described method not only detects the named entities but also assigns a detailed semantic interpretation to them, we call our approach the semantic entity detection. All the algorithms are designed to use automata operations defined within the framework of weighted finite state transducers (WFST) - the ASR lattices are nowadays frequently represented as weighted acceptors. The expert knowledge about the semantics of the task at hand can be first expressed in the form of a context free grammar and then converted to the FST form. We use a WFST optimization to obtain compact representation of the ASR lattice. The WFST framework also allows to use the word confusion networks as another representation of multiple ASR hypotheses. That way we can use the full power of composition and optimization operations implemented in the OpenFST toolkit for our semantic entity detection algorithm. The devised method also employs the concept of a factor automaton; this approach allows us to overcome the need for a filler model and consequently makes the method more general. The paper includes experimental evaluation of the proposed algorithm and compares the performance obtained by using the one-best word hypothesis, optimized lattices and word confusion networks.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124045880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning better lexical properties for recurrent OOV words","authors":"Longlu Qin, Alexander I. Rudnicky","doi":"10.1109/ASRU.2013.6707699","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707699","url":null,"abstract":"Out-of-vocabulary (OOV) words can appear more than once in a conversation or over a period of time. Such multiple instances of the same OOV word provide valuable information for learning the lexical properties of the word. Therefore, we investigated how to estimate better pronunciation, spelling and part-of-speech (POS) label for recurrent OOV words. We first identified recurrent OOV words from the output of a hybrid decoder by applying a bottom-up clustering approach. Then, multiple instances of the same OOV word were used simultaneously to learn properties of the OOV word. The experimental results showed that the bottom-up clustering approach is very effective at detecting the recurrence of OOV words. Furthermore, by using evidence from multiple instances of the same word, the pronunciation accuracy, recovery rate and POS label accuracy of recurrent OOV words can be substantially improved.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117076308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compact acoustic modeling based on acoustic manifold using a mixture of factor analyzers","authors":"Wenlin Zhang, Bi-cheng Li, Weiqiang Zhang","doi":"10.1109/ASRU.2013.6707702","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707702","url":null,"abstract":"A compact acoustic model for speech recognition is proposed based on nonlinear manifold modeling of the acoustic feature space. Acoustic features of the speech signal is assumed to form a low-dimensional manifold, which is modeled by a mixture of factor analyzers. Each factor analyzer describes a local area of the manifold using a low-dimensional linear model. For an HMM-based speech recognition system, observations of a particular state are constrained to be located on part of the manifold, which may cover several factor analyzers. For each tied-state, a sparse weight vector is obtained through an iteration shrinkage algorithm, in which the sparseness is determined automatically by the training data. For each nonzero component of the weight vector, a low-dimensional factor is estimated for the corresponding factor model according to the maximum a posteriori (MAP) criterion, resulting in a compact state model. Experimental results show that compared with the conventional HMM-GMM system and the SGMM system, the new method not only contains fewer parameters, but also yields better recognition results.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129363371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roger Hsiao, Tim Ng, F. Grézl, D. Karakos, S. Tsakalidis, L. Nguyen, R. Schwartz
{"title":"Discriminative semi-supervised training for keyword search in low resource languages","authors":"Roger Hsiao, Tim Ng, F. Grézl, D. Karakos, S. Tsakalidis, L. Nguyen, R. Schwartz","doi":"10.1109/ASRU.2013.6707770","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707770","url":null,"abstract":"In this paper, we investigate semi-supervised training for low resource languages where the initial systems may have high error rate (≥ 70.0% word eror rate). To handle the lack of data, we study semi-supervised techniques including data selection, data weighting, discriminative training and multilayer perceptron learning to improve system performance. The entire suite of semi-supervised methods presented in this paper was evaluated under the IARPA Babel program for the keyword spotting tasks. Our semi-supervised system had the best performance in the OpenKWS13 surprise language evaluation for the limited condition. In this paper, we describe our work on the Turkish and Vietnamese systems.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127426312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Saraçlar, A. Sethy, B. Ramabhadran, L. Mangu, Jia Cui, Xiaodong Cui, Brian Kingsbury, Jonathan Mamou
{"title":"An empirical study of confusion modeling in keyword search for low resource languages","authors":"M. Saraçlar, A. Sethy, B. Ramabhadran, L. Mangu, Jia Cui, Xiaodong Cui, Brian Kingsbury, Jonathan Mamou","doi":"10.1109/ASRU.2013.6707774","DOIUrl":"https://doi.org/10.1109/ASRU.2013.6707774","url":null,"abstract":"Keyword search, in the context of low resource languages, has emerged as a key area of research. The dominant approach in keyword search is to use Automatic Speech Recognition (ASR) as a front end to produce a representation of audio that can be indexed. The biggest drawback of this approach lies in its the inability to deal with out-of-vocabulary words and query terms that are not in the ASR system output. In this paper we present an empirical study evaluating various approaches based on using confusion models as query expansion techniques to address this problem. We present results across four languages using a range of confusion models which lead to significant improvements in keyword search performance as measured by the Maximum Term Weighted Value (MTWV) metric.","PeriodicalId":265258,"journal":{"name":"2013 IEEE Workshop on Automatic Speech Recognition and Understanding","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124150874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}