{"title":"Scalable multi-domain dialogue state tracking","authors":"Abhinav Rastogi, Dilek Z. Hakkani-Tür, Larry Heck","doi":"10.1109/ASRU.2017.8268986","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268986","url":null,"abstract":"Dialogue state tracking (DST) is a key component of task-oriented dialogue systems. DST estimates the user's goal at each user turn given the interaction until then. State of the art approaches for state tracking rely on deep learning methods, and represent dialogue state as a distribution over all possible slot values for each slot present in the ontology. Such a representation is not scalable when the set of possible values are unbounded (e.g., date, time or location) or dynamic (e.g., movies or usernames). Furthermore, training of such models requires labeled data, where each user turn is annotated with the dialogue state, which makes building models for new domains challenging. In this paper, we present a scalable multi-domain deep learning based approach for DST. We introduce a novel framework for state tracking which is independent of the slot value set, and represent the dialogue state as a distribution over a set of values of interest (candidate set) derived from the dialogue history or knowledge. Restricting these candidate sets to be bounded in size addresses the problem of slot-scalability. Furthermore, by leveraging the slot-independent architecture and transfer learning, we show that our proposed approach facilitates quick adaptation to new domains.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121649087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Consistent DNN uncertainty training and decoding for robust ASR","authors":"K. Nathwani, E. Vincent, I. Illina","doi":"10.1109/ASRU.2017.8268934","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268934","url":null,"abstract":"We consider the problem of robust automatic speech recognition (ASR) in noisy conditions. The performance improvement brought by speech enhancement is often limited by residual distortions of the enhanced features, which can be seen as a form of statistical uncertainty. Uncertainty estimation and propagation methods have recently been proposed to improve the ASR performance with deep neural network (DNN) acoustic models. However, the performance is still limited due to the use of uncertainty only during decoding. In this paper, we propose a consistent approach to account for uncertainty in the enhanced features during both training and decoding. We estimate the variance of the distortions using a DNN uncertainty estimator that operates directly in the feature maximum likelihood linear regression (fMLLR) domain and we then sample the uncertain features using the unscented transform (UT). We report the resulting ASR performance on the CHiME-2 and CHiME-3 datasets for different uncertainty estimation/propagation techniques. The proposed DNN uncertainty training method brings 4% and 8% relative improvement on these two datasets, respectively, compared to a competitive fMLLR-domain DNN acoustic modeling baseline.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"408 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124334924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Topic segmentation in ASR transcripts using bidirectional RNNS for change detection","authors":"I. Sheikh, D. Fohr, I. Illina","doi":"10.1109/ASRU.2017.8268979","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268979","url":null,"abstract":"Topic segmentation methods are mostly based on the idea of lexical cohesion, in which lexical distributions are analysed across the document and segment boundaries are marked in areas of low cohesion. We propose a novel approach for topic segmentation in speech recognition transcripts by measuring lexical cohesion using bidirectional Recurrent Neural Networks (RNN). The bidirectional RNNs capture context in the past and the following set of words. The past and following contexts are compared to perform topic change detection. In contrast to existing works based on sequence and discriminative models for topic segmentation, our approach does not use a segmented corpus nor (pseudo) topic labels for training. Our model is trained using news articles obtained from the internet. Evaluation on ASR transcripts of French TV broadcast news programs demonstrates the effectiveness of our proposed approach.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116383065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhuo Chen, Jinyu Li, Xiong Xiao, Takuya Yoshioka, Huaming Wang, Zhenghao Wang, Y. Gong
{"title":"Cracking the cocktail party problem by multi-beam deep attractor network","authors":"Zhuo Chen, Jinyu Li, Xiong Xiao, Takuya Yoshioka, Huaming Wang, Zhenghao Wang, Y. Gong","doi":"10.1109/ASRU.2017.8268969","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268969","url":null,"abstract":"While recent progresses in neural network approaches to singlechannel speech separation, or more generally the cocktail party problem, achieved significant improvement, their performance for complex mixtures is still not satisfactory. In this work, we propose a novel multi-channel framework for multi-talker separation. In the proposed model, an input multi-channel mixture signal is firstly converted to a set of beamformed signals using fixed beam patterns. For this beamforming, we propose to use differential beamformers as they are more suitable for speech separation. Then each beamformed signal is fed into a single-channel anchored deep attractor network to generate separated signals. And the final separation is acquired by post selecting the separating output for each beams. To evaluate the proposed system, we create a challenging dataset comprising mixtures of 2, 3 or 4 speakers. Our results show that the proposed system largely improves the state of the art in speech separation, achieving 11.5 dB, 11.76 dB and 11.02 dB average signal-to-distortion ratio improvement for 4, 3 and 2 overlapped speaker mixtures, which is comparable to the performance of a minimum variance distortionless response beamformer that uses oracle location, source, and noise information. We also run speech recognition with a clean trained acoustic model on the separated speech, achieving relative word error rate (WER) reduction of 45.76%, 59.40% and 62.80% on fully overlapped speech of 4, 3 and 2 speakers, respectively. With a far talk acoustic model, the WER is further reduced.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115267642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ONENET: Joint domain, intent, slot prediction for spoken language understanding","authors":"Young-Bum Kim, Sungjin Lee, K. Stratos","doi":"10.1109/ASRU.2017.8268984","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268984","url":null,"abstract":"In practice, most spoken language understanding systems process user input in a pipelined manner; first domain is predicted, then intent and semantic slots are inferred according to the semantic frames of the predicted domain. The pipeline approach, however, has some disadvantages: error propagation and lack of information sharing. To address these issues, we present a unified neural network that jointly performs domain, intent, and slot predictions. Our approach adopts a principled architecture for multitask learning to fold in the state-of-the-art models for each task. With a few more ingredients, e.g. orthography-sensitive input encoding and curriculum training, our model delivered significant improvements in all three tasks across all domains over strong baselines, including one using oracle prediction for domain detection, on real user data of a commercial personal assistant.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128350841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised adaptation of student DNNS learned from teacher RNNS for improved ASR performance","authors":"Lahiru Samarakoon, B. Mak","doi":"10.1109/ASRU.2017.8268936","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268936","url":null,"abstract":"In automatic speech recognition (ASR), adaptation techniques are used to minimize the mismatch between training and testing conditions. Many successful techniques have been proposed for deep neural network (DNN) acoustic model (AM) adaptation. Recently, recurrent neural networks (RNNs) have outperformed DNNs in ASR tasks. However, the adaptation of RNN AMs is challenging and in some cases when combined with adaptation, DNN AMs outperform adapted RNN AMs. In this paper, we combine student-teacher training and unsupervised adaptation to improve ASR performance. First, RNNs are used as teachers to train student DNNs. Then, these student DNNs are adapted in an unsupervised fashion. Experimental results on the AMI IHM and AMI SDM tasks show that student DNNs are adaptable with significant performance improvements for both frame-wise and sequentially trained systems. We also show that the combination of adapted DNNs with teacher RNNs can further improve the performance.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130666546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youssef Oualil, D. Klakow, György Szaszák, A. Srinivasamurthy, H. Helmke, P. Motlícek
{"title":"A context-aware speech recognition and understanding system for air traffic control domain","authors":"Youssef Oualil, D. Klakow, György Szaszák, A. Srinivasamurthy, H. Helmke, P. Motlícek","doi":"10.1109/ASRU.2017.8268964","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268964","url":null,"abstract":"Automatic Speech Recognition and Understanding (ASRU) systems can generally use temporal and situational context information to improve their performance for a given task. This is typically done by rescoring the ASR hypotheses or by dynamically adapting the ASR models. For some domains, such as Air Traffic Control (ATC), this context information can be, however, small in size, partial and available only as abstract concepts (e.g. airline codes), which are difficult to map into full possible spoken sentences to perform rescoring or adaptation. This paper presents a multi-modal ASRU system, which dynamically integrates partial temporal and situational ATC context information to improve its performance. This is done either by 1) extracting word sequences which carry relevant ATC information from ASR N-best Lists and then perform a context-based rescoring on the extracted ATC segments or 2) by a partial adaptation of the language model. Experiments conducted on 4 hours of test data from Prague and Vienna approach (arrivals) showed a relative reduction of the ATC command error rate metric by 30% to 50%.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134354261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yao Qian, Keelan Evanini, P. Lange, Robert A. Pugh, Rutuja Ubale, F. Soong
{"title":"Improving native language (L1) identifation with better VAD and TDNN trained separately on native and non-native English corpora","authors":"Yao Qian, Keelan Evanini, P. Lange, Robert A. Pugh, Rutuja Ubale, F. Soong","doi":"10.1109/ASRU.2017.8268992","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268992","url":null,"abstract":"Identifying a speaker's native language (L1), i.e., mother tongue, based upon non-native English (L2) speech input, is both challenging and useful for many human-machine voice interface applications, e.g., computer assisted language learning (CALL). In this paper, we improve our sub-phone TDNN based i-vector approach to L1 recognition with a more accurate TDNN-derived VAD and a highly discriminative classifier. Two TDNNs are separately trained on native and non-native English, LVCSR corpora, for contrasting their corresponding sub-phone posteriors and resultant supervectors. The derived i-vectors are then exploited for improving the performance further. Experimental results on a database of 25 L1s show a 3.1% identification rate improvement, from 78.7% to 81.8%, compared with a high performance baseline system which has already achieved the best published results on the 2016 ComParE corpus of only 11 L1s. The statistical analysis of the features used in our system provides useful findings, e.g. pronunciation similarity among the non-native English speakers with different L1s, for research on second-language (L2) learning and assessment.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"41 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133022402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li-Juan Liu, Chuang Ding, Ya-Jun Hu, Zhenhua Ling, Yuan Jiang, M. Zhou, Si Wei
{"title":"The iFLYTEK system for blizzard machine learning challenge 2017-ES1","authors":"Li-Juan Liu, Chuang Ding, Ya-Jun Hu, Zhenhua Ling, Yuan Jiang, M. Zhou, Si Wei","doi":"10.1109/ASRU.2017.8268999","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268999","url":null,"abstract":"This paper introduces the speech synthesis system submitted by IFLYTEK for the Blizzard Machine Learning Challenge 2017-ES1. Linguistic and acoustic features from a 4hour corpus were released for this task. Participants are expected to build a speech synthesis system on the given linguist and acoustic features without using any external data. Our system is composed of a long short term memory (LSTM) recurrent neural network (RNN)-based acoustic model and a generative adversarial network (GAN)-based post-filter for mel-cepstra. Two approaches to build GAN-based post-filter are implemented and compared in our experiments. The first one is to predict the residuals of mel-cepstra given the mel-cepstra predicted by the LSTM-based acoustic model. However, this method leads to unstable synthetic speech sounds in our experiments, which may be due to the poor quality of analysis-synthesis speech using the natural acoustic features given by this corpus. The other approach is to ignore the detailed components of natural mel-cepstra by dimension reduction using principal component analysis (PCA) and then recover them back using GAN given the main PCA components. At synthesis time, mel-cepstra predicted by the RNN acoustic model are first projected to the main PCA components, which are then sent to the GAN for detail recovering. Finally, the second approach is used in the final submitted system. The evaluation results show the effectiveness of our submitted system.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133746917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yao Qian, Rutuja Ubale, Vikram Ramanarayanan, P. Lange, David Suendermann-Oeft, Keelan Evanini, Eugene Tsuprun
{"title":"Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system","authors":"Yao Qian, Rutuja Ubale, Vikram Ramanarayanan, P. Lange, David Suendermann-Oeft, Keelan Evanini, Eugene Tsuprun","doi":"10.1109/ASRU.2017.8268987","DOIUrl":"https://doi.org/10.1109/ASRU.2017.8268987","url":null,"abstract":"Spoken language understanding (SLU) in dialog systems is generally performed using a natural language understanding (NLU) model based on the hypotheses produced by an automatic speech recognition (ASR) system. However, when new spoken dialog applications are built from scratch in real user environments that often have sub-optimal audio characteristics, ASR performance can suffer due to factors such as the paucity of training data or a mismatch between the training and test data. To address this issue, this paper proposes an ASR-free, end-to-end (E2E) modeling approach to SLU for a cloud-based, modular spoken dialog system (SDS). We evaluate the effectiveness of our approach on crowdsourced data collected from non-native English speakers interacting with a conversational language learning application. Experimental results show that our approach is particularly promising in situations with low ASR accuracy. It can further improve the performance of a sophisticated CNN-based SLU system with more accurate ASR hypotheses by fusing the scores from E2E system, i.e., the overall accuracy of SLU is improved from 85.6% to 86.5%.","PeriodicalId":290868,"journal":{"name":"2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129536798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}