{"title":"Protoda: Efficient Transfer Learning for Few-Shot Intent Classification","authors":"Manoj Kumar, Varun Kumar, Hadrien Glaude, Cyprien delichy, Aman Alok, Rahul Gupta","doi":"10.1109/SLT48900.2021.9383495","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383495","url":null,"abstract":"Practical sequence classification tasks in natural language processing often suffer from low training data availability for target classes. Recent works towards mitigating this problem have focused on transfer learning using embeddings pre-trained on often unrelated tasks, for instance, language modeling. We adopt an alternative approach by transfer learning on an ensemble of related tasks using prototypical networks under the meta-learning paradigm. Using intent classification as a case study, we demonstrate that increasing variability in training tasks can significantly improve classification performance. Further, we apply data augmentation in conjunction with meta-learning to reduce sampling bias. We make use of a conditional generator for data augmentation that is trained directly using the meta-learning objective and simultaneously with prototypical networks, hence ensuring that data augmentation is customized to the task. We explore augmentation in the sentence embedding space as well as prototypical embedding space. Combining meta-learning with augmentation provides upto 6.49% and 8.53% relative F1-score improvements over the best performing systems in the 5-shot and 10-shot learning, respectively.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127122752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Hu, Ruoming Pang, Tara N. Sainath, Trevor Strohman
{"title":"Transformer Based Deliberation for Two-Pass Speech Recognition","authors":"Ke Hu, Ruoming Pang, Tara N. Sainath, Trevor Strohman","doi":"10.1109/SLT48900.2021.9383497","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383497","url":null,"abstract":"Interactive speech recognition systems must generate words quickly while also producing accurate results. Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate. Previous work has established that a deliberation network can be an effective second-pass model. The model attends to two kinds of inputs at once: encoded audio frames and the hypothesis text from the first-pass model. In this work, we explore using transformer layers instead of long-short term memory (LSTM) layers for deliberation rescoring. In transformer layers, we generalize the \"encoder-decoder\" attention to attend to both encoded audio and first-pass text hypotheses. The output context vectors are then combined by a merger layer. Compared to LSTM-based deliberation, our best transformer deliberation achieves 7% relative word error rate improvements along with a 38% reduction in computation. We also compare against non-deliberation transformer rescoring, and find a 9% relative improvement.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122028501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Speech Recognition Accuracy of Local POI Using Geographical Models","authors":"Songjun Cao, Yike Zhang, Xiaobing Feng, Long Ma","doi":"10.1109/SLT48900.2021.9383538","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383538","url":null,"abstract":"Nowadays voice search for points of interest (POI) is becoming increasingly popular. However, speech recognition for local POI names still remains a challenge due to multi-dialect and long-tailed distribution of POI names. This paper improves speech recognition accuracy for local POI from two aspects. Firstly, a geographic acoustic model (Geo-AM) is proposed. The proposed Geo-AM deals with multi-dialect problem using dialect-specific input feature and dialect-specific top layers. Secondly, a group of geo-specific language models (Geo-LMs) are integrated into our speech recognition system to improve recognition accuracy of long-tailed and homophone POI names. During decoding, a specific Geo-LM is selected on-demand according to the user’s geographic location. Experiments show that the proposed Geo-AM achieves 6.5%~10.1% relative character error rate (CER) reduction on an accent test set and the proposed Geo-AM and Geo-LMs totally achieve over 18.7% relative CER reduction on a voice search task for Tencent Map.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129689295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigations on audiovisual emotion recognition in noisy conditions","authors":"M. Neumann, Ngoc Thang Vu","doi":"10.1109/SLT48900.2021.9383588","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383588","url":null,"abstract":"In this paper we explore audiovisual emotion recognition under noisy acoustic conditions with a focus on speech features. We attempt to answer the following research questions: (i) How does speech emotion recognition perform on noisy data? and (ii) To what extend does a multimodal approach improve the accuracy and compensate for potential performance degradation at different noise levels?We present an analytical investigation on two emotion datasets with superimposed noise at different signal-to-noise ratios, comparing three types of acoustic features. Visual features are incorporated with a hybrid fusion approach: The first neural network layers are separate modality-specific ones, followed by at least one shared layer before the final prediction. The results show a significant performance decrease when a model trained on clean audio is applied to noisy data and that the addition of visual features alleviates this effect.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126885789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Rutowski, Elizabeth Shriberg, A. Harati, Yang Lu, R. Oliveira, P. Chlebek
{"title":"Cross-Demographic Portability of Deep NLP-Based Depression Models","authors":"T. Rutowski, Elizabeth Shriberg, A. Harati, Yang Lu, R. Oliveira, P. Chlebek","doi":"10.1109/SLT48900.2021.9383609","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383609","url":null,"abstract":"Deep learning models are rapidly gaining interest for real-world applications in behavioral health. An important gap in current literature is how well such models generalize over different populations. We study Natural Language Processing (NLP) based models to explore portability over two different corpora highly mismatched in age. The first and larger corpus contains younger speakers. It is used to train an NLP model to predict depression. When testing on unseen speakers from the same age distribution, this model performs at AUC=0.82. We then test this model on the second corpus, which comprises seniors from a retirement community. Despite the large demographic differences in the two corpora, we saw only modest degradation in performance for the senior-corpus data, achieving AUC=0.76. Interestingly, in the senior population, we find AUC=0.81 for the subset of patients whose health state is consistent over time. Implications for demographic portability of speech-based applications are discussed.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121390988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Personalizing Speech Start Point and End Point Detection in ASR Systems from Speaker Embeddings","authors":"Aditya Jayasimha, Periyasamy Paramasivam","doi":"10.1109/SLT48900.2021.9383516","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383516","url":null,"abstract":"Start Point Detection (SPD) and End Point Detection (EPD) in Automatic Speech Recognition (ASR) systems are the tasks of detecting the time at which the user starts speaking and stops speaking respectively. They are crucial problems in ASR as inaccurate detection of SPD and/or EPD leads to poor ASR performance and bad user experience. The main challenge involved in SPD and EPD is accurate detection in noisy environments, especially when speech noise is significant in the background. The current approaches tend to fail to distinguish between the speech of the real user and speech in the background. In this work, we aim to improve SPD and EPD in a multi-speaker environment. We propose a novel approach that personalizes SPD and EPD to a desired user and helps improve ASR quality and latency. We combine user-specific information (i-vectors) with traditional speech features (log-mel) and build a Convolutional, Long Short-Term Memory, Deep Neural Network (CLDNN) model to achieve personalized SPD and EPD. The proposed approach achieves a relative improvement of 46.53% and 11.31% in SPD accuracy, and 27.87% and 5.31% in EPD accuracy at SNR 0 and 5 dB respectively over the standard non-personalized models.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122251503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Innovative Bert-Based Reranking Language Models for Speech Recognition","authors":"Shih-Hsuan Chiu, Berlin Chen","doi":"10.1109/SLT48900.2021.9383557","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383557","url":null,"abstract":"More recently, Bidirectional Encoder Representations from Transformers (BERT) was proposed and has achieved impressive success on many natural language processing (NLP) tasks such as question answering and language understanding, due mainly to its effective pre-training then fine-tuning paradigm as well as strong local contextual modeling ability. In view of the above, this paper presents a novel instantiation of the BERT-based contextualized language models (LMs) for use in reranking of N-best hypotheses produced by automatic speech recognition (ASR). To this end, we frame N-best hypothesis reranking with BERT as a prediction problem, which aims to predict the oracle hypothesis that has the lowest word error rate (WER) given the N-best hypotheses (denoted by PBERT). In particular, we also explore to capitalize on task-specific global topic information in an unsupervised manner to assist PBERT in N-best hypothesis reranking (denoted by TPBERT). Extensive experiments conducted on the AMI benchmark corpus demonstrate the effectiveness and feasibility of our methods in comparison to the conventional autoregressive models like the recurrent neural network (RNN) and a recently proposed method that employed BERT to compute pseudo-log-likelihood (PLL) scores for N-best hypothesis reranking.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125876342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised Acoustic-to-Articulatory Inversion Neural Network Learning Based on Deterministic Policy Gradient","authors":"Hayato Shibata, Mingxin Zhang, T. Shinozaki","doi":"10.1109/SLT48900.2021.9383554","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383554","url":null,"abstract":"This paper presents an unsupervised learning method of deep neural networks that perform acoustic-to-articulatory inversion for arbitrary utterances. Conventional unsupervised acoustic-to-articulatory inversion methods are based on the analysis-by-synthesis approach and non-linear optimization algorithms. One limitation is that they require time-consuming iterative optimizations to obtain articulatory parameters for a given target speech segment. Neural networks, after learning their relationship, can obtain these articulatory parameters without an iterative optimization. However, conventional methods need supervised learning and paired acoustic and articulatory samples. We propose a hybrid auto-encoder based unsupervised learning framework for the acoustic-to-articulatory inversion neural networks that can capture context information. The essential point of the framework is making the training effective. We investigate several reinforcement learning algorithms and show the usefulness of the deterministic policy gradient. Experimental results demonstrate that the proposed method can infer articulatory parameters not only for training set segments but also for unseen utterances. Averaged reconstruction errors achieved for open test samples are similar to or even lower than the conventional method that directly optimizes the articulatory parameters in a closed condition.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128344408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Embedding Aggregation for Far-Field Speaker Verification with Distributed Microphone Arrays","authors":"Danwei Cai, Ming Li","doi":"10.1109/SLT48900.2021.9383501","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383501","url":null,"abstract":"With the successful application of deep speaker embedding networks, the performance of speaker verification systems has significantly improved under clean and close-talking settings; however, unsatisfactory performance persists under noisy and far-field environments. This study aims at improving the performance of far-field speaker verification systems with distributed microphone arrays in the smart home scenario. The proposed learning framework consists of two modules: a deep speaker embedding module and an aggregation module. The former extracts a speaker embedding for each recording. The latter, based on either averaged pooling or attentive pooling, aggregates speaker embeddings and learns a unified representation for all recordings captured by distributed microphone arrays. The two modules are trained in an end-to-end manner. To evaluate this framework, we conduct experiments on the real text-dependent far-field datasets Hi Mia. Results show that our framework outperforms the naive averaged aggregation methods by 20% in terms of equal error rate (EER) with six distributed microphone arrays. Also, we find that the attention-based aggregation advocates high-quality recordings and repels low-quality ones.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130690166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}