Zhengyu Zhou, I. G. Choi, Yongliang He, Vikas Yadav, Chin-Hui Lee
{"title":"Using Paralinguistic Information to Disambiguate User Intentions for Distinguishing Phrase Structure and Sarcasm in Spoken Dialog Systems","authors":"Zhengyu Zhou, I. G. Choi, Yongliang He, Vikas Yadav, Chin-Hui Lee","doi":"10.1109/SLT48900.2021.9383505","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383505","url":null,"abstract":"This paper aims at utilizing paralinguistic information usually hidden in speech signals, such as pitch, short pause and sarcasm, to disambiguate user intention not easily distinguishable from speech recognition and natural language understanding results provided by a state-of-the-art spoken dialog system (SDS). We propose two methods to address the ambiguities in understanding name entities and sentence structures based on relevant speech cues and nuances. We also propose an approach to capturing sarcasm in speech and generating sarcasm-sensitive responses using an end-to-end neural network. An SDS prototype that directly feeds signal information into the understanding and response generation components has also been developed to support the three proposed applications. We have achieved encouraging experimental results in this initial study, demonstrating the potential of this new research direction.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120941682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Feature Learning with Canonical Correlation Analysis Constraint for Text-Independent Speaker Verification","authors":"Zheng Li, Miao Zhao, Lin Li, Q. Hong","doi":"10.1109/SLT48900.2021.9383541","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383541","url":null,"abstract":"In order to improve the performance and robustness of text-independent speaker verification systems, various speaker embedding representation learning algorithms have been developed. Typically, exploring manifold kinds of features to describe speaker-related embeddings is a common approach, such as introducing more acoustic features or different resolution scale features. In this paper, a new multi-feature learning strategy with canonical correlation analysis (CCA) constraint is proposed to learn the instinct speaker embeddings, which maximizes the correlation between two features from the same utterance. Based on the multi-feature learning structure, the CCA constraint layer and the CCA loss are utilized to explore the correlation representation between the two kinds of features and alleviate the redundancy. Therefore, two multi-feature learning strategies are studied, using the pairwise acoustic features, and the pair of short-term and long-term features. Furthermore, we improve the long short-term feature learning structure by replacing the LSTM block with the Bidirectional-GRU (B-GRU) block and introducing more dense layers. The effectiveness of these improvements are shown on the VoxCeleb 1 evaluation set, the noisy Vox-Celeb 1 evaluation set and the SITW evaluation set.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116696420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Convolution-Based Attention Model With Positional Encoding For Streaming Speech Recognition On Embedded Devices","authors":"Jinhwan Park, Chanwoo Kim, Wonyong Sung","doi":"10.1109/SLT48900.2021.9383583","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383583","url":null,"abstract":"On-device automatic speech recognition (ASR) is much more preferred over server-based implementations owing to its low latency and privacy protection. Many server-based ASRs employ recurrent neural networks (RNNs) to exploit their ability to recognize long sequences with a limited number of states; however, they are inefficient for single-stream implementations in embedded devices. In this study, a highly efficient convolutional model-based ASR with monotonic chunkwise attention is developed. Although temporal convolution-based models allow more efficient implementations, they demand a long filter-length to avoid looping or skipping problems. To remedy this problem, we add positional encoding, while shortening the filter length, to a convolution-based ASR encoder. It is demonstrated that the accuracy of the short filter-length convolutional model is significantly improved. In addition, the effect of positional encoding is analyzed by visualizing the attention energy and encoder outputs. The proposed model achieves the word error rate of 11.20% on TED-LIUMv2 for an end-to-end speech recognition task.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114609393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Dataset for Natural Language Understanding of Exercise Logs in a Food and Fitness Spoken Dialogue System","authors":"Maya Epps, J. Uribe, M. Korpusik","doi":"10.1109/SLT48900.2021.9383508","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383508","url":null,"abstract":"Health and fitness are becoming increasingly important in the United States, as illustrated by the 70% of adults in the U.S. that are classified as overweight or obese, as well as globally, where obesity nearly tripled since 1975. Prior work used convolutional neural networks (CNNs) to understand a spoken sentence describing one’s meal, in order to expedite the meal-logging process. However, the system lacked a complementary exercise-logging component. We have created a new dataset of 3,000 natural language exercise-logging sentences. Each token was tagged as an Exercise, Feeling, or Other, and mapped to the most relevant exercise, as well as a score of how they felt on a scale from 1 to 10. We demonstrate the following: for intent detection (i.e., logging a meal or exercise), logistic regression achieves over 99% accuracy on a held-out test set; for semantic tagging, contextual embedding models achieve 93% F1 score, outperforming conditional random field models (CRFs); and recurrent neural networks (RNNs) trained on a multiclass classification task successfully map tagged exercise and feeling segments to database matches. By connecting how the user felt while exercising to the food they ate, in the future we may provide personalized and dynamic diet recommendations.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124633370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Development of CNN-Based Cochlear Implant and Normal Hearing Sound Recognition Models Using Natural and Auralized Environmental Audio","authors":"R. Shekar, Chelzy Belitz, J. Hansen","doi":"10.1109/SLT48900.2021.9383550","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383550","url":null,"abstract":"Restoration of auditory function among hearing impaired individuals using Cochlear Implant (CI) technology has contributed significantly towards an improved quality of life. CI users experience greater challenges in recognizing speech effectively in noisy, reverberant, or time-varying diverse environments. Most CI research efforts focus on enhancing speech perception and environmental sound awareness has received little or no attention. This study focuses on a comparative analysis of normal hearing (NH) vs. CI environmental sound recognition using classifiers trained on learned sound representations using a CNN-based sound event model. Sounds experienced by CI listeners are recreated by auralizing electrical stimuli. CCi-MOBILE is used to generate electrical stimuli and Braecker Vocoder is used for auralization. Natural and auralized sound representations are then applied in order to develop NH and CI sound recognition models. Comparative assessment of environmental sound recognition is carried out by analyzing f1-scores and other performance characteristics. Benefits stemming from this research can help CI researchers improve sound recognition performance, develop novel sound processing algorithms, exclusively for environmental sounds, and identify optimal CI electrical stimulation characteristics to enhance sound perception. Among CI users, improvement in environmental sound awareness contributes to improved quality of life.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131326767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Automatic Route Description Unification in Spoken Dialog Systems","authors":"Yulan Feng, A. Black, M. Eskénazi","doi":"10.1109/SLT48900.2021.9383465","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383465","url":null,"abstract":"In telephone-based dialog navigation systems, scheduling and direction information are typically collected from routing APIs in text, and then delivered to users via speech. These systematic directions may be augmented with human descriptions to provide more accurate and personalized routes and cover broader user needs. However, manually collecting, transcribing, correcting, and rewriting human descriptions is time-consuming. Also its inconsistency with systematic directions can be confusing to users when delivered orally. This paper describes the construction of a pipeline to automate the route description unification process which also renders the resulting direction delivery more concise and consistent.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125643886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Word Similarity Based Label Smoothing in Rnnlm Training for ASR","authors":"Minguang Song, Yunxin Zhao, Shaojun Wang, Mei Han","doi":"10.1109/SLT48900.2021.9383598","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383598","url":null,"abstract":"Label smoothing has been shown as an effective regularization approach for deep neural networks. Recently, a context-sensitive label smoothing approach was proposed for training RNNLMs that improved word error rates on speech recognition tasks. Despite the performance gains, its plausible candidate words for label smoothing were confined to n-grams observed in training data. To investigate the potential of label smoothing in model training with insufficient data, in this current work, we propose to utilize the similarity between word embeddings to build a candidate word set for each target word, where by doing so, plausible words outside the n-grams in training data may be found and introduced into candidate word sets for label smoothing. Moreover, we propose to combine the smoothing labels from the n-gram based and the word similarity based methods to improve the generalization capability of RNNLMs. Our proposed approach to RNNLM training has been evaluated for n-best list rescoring on speech recognition tasks of WSJ and AMI, with improved experimental results on word error rates confirming its effectiveness.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"174 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116435415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Marini, Mauro Viganò, M. Corbo, M. Zettin, Gloria Simoncini, B. Fattori, Clelia D'Anna, M. Donati, L. Fanucci
{"title":"IDEA: An Italian Dysarthric Speech Database","authors":"Marco Marini, Mauro Viganò, M. Corbo, M. Zettin, Gloria Simoncini, B. Fattori, Clelia D'Anna, M. Donati, L. Fanucci","doi":"10.1109/SLT48900.2021.9383467","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383467","url":null,"abstract":"This paper describes IDEA a database of Italian dysarthric speech produced by 45 speakers affected by 8 different pathologies. Neurologic diagnoses were collected from the subjects’ medical records, while dysarthria assessment was conducted by a speech language pathologist and neurologist. The total number of records is 16794. The speech material consists of 211 isolated common words recorded by a single condenser microphone. The words that refer to an ambient assisted living scenario, have been selected to cover as widely as possible all Italian phonemes.The recordings, supervised by a speech pathologist, were recorded through the RECORDIA software that was developed specifically for this task. It allows multiple recording procedures depending on the patient severity and it includes an electronic record for storing patients’ clinical data. All the recordings in IDEA are annotated with a TextGrid file which defines the boundaries of the speech within the wave file and other types of notes about the record.This paper also includes preliminary experiments on the recorded data to train an automatic speech recognition system from a baseline Kaldi recipe. We trained HMM and DNN models and the results shows 11.75% and 14.99% of WER respectively.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"330 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116355447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, Awni Y. Hannun
{"title":"Semi-Supervised end-to-end Speech Recognition via Local Prior Matching","authors":"Wei-Ning Hsu, Ann Lee, Gabriel Synnaeve, Awni Y. Hannun","doi":"10.1109/SLT48900.2021.9383552","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383552","url":null,"abstract":"For sequence transduction tasks like speech recognition, a strong structured prior model encodes rich information about the target space, implicitly ruling out invalid sequences by assigning them low probability. In this work, we propose local prior matching (LPM), a semi-supervised objective that distills knowledge from a strong prior (e.g. a language model) to provide learning signal to an end-to-end model trained on unlabeled speech. We demonstrate that LPM is simple to implement and superior to existing knowledge distillation techniques under comparable settings. Starting from a baseline trained on 100 hours of labeled speech, with an additional 360 hours of unlabeled data, LPM recovers 54%/82% and 73%/91% of the word error rate on clean and noisy test sets with/without language model rescoring relative to a fully supervised model on the same data.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127164310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bipasha Sen, Aditya Agarwal, Mirishkar Sai Ganesh, A. Vuppala
{"title":"Reed: An Approach Towards Quickly Bootstrapping Multilingual Acoustic Models","authors":"Bipasha Sen, Aditya Agarwal, Mirishkar Sai Ganesh, A. Vuppala","doi":"10.1109/SLT48900.2021.9383457","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383457","url":null,"abstract":"Multilingual automatic speech recognition (ASR) system is a single entity capable of transcribing multiple languages sharing a common phone space. Performance of such a system is highly dependent on the compatibility of the languages. State of the art speech recognition systems are built using sequential architectures based on recurrent neural networks (RNN) limiting the computational parallelization in training. This poses a significant challenge in terms of time taken to bootstrap and validate the compatibility of multiple languages for building a robust multilingual system. Complex architectural choices based on self-attention networks are made to improve the parallelization thereby reducing the training time. In this work, we propose Reed, a simple system based on 1D convolutions which uses very short context to improve the training time. To improve the performance of our system, we use raw time-domain speech signals directly as input. This enables the convolutional layers to learn feature representations rather than relying on handcrafted features such as MFCC. We report improvement on training and inference times by atleast a factor of 4× and 7.4× respectively with comparable WERs against standard RNN based baseline systems on SpeechOcean’s multilingual low resource dataset.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125914731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}