2021 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
SLT 2021 Author Index SLT 2021作者索引
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/slt48900.2021.9383561
{"title":"SLT 2021 Author Index","authors":"","doi":"10.1109/slt48900.2021.9383561","DOIUrl":"https://doi.org/10.1109/slt48900.2021.9383561","url":null,"abstract":"","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127117615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Protoda: Efficient Transfer Learning for Few-Shot Intent Classification Protoda:少射意图分类的高效迁移学习
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383495
Manoj Kumar, Varun Kumar, Hadrien Glaude, Cyprien delichy, Aman Alok, Rahul Gupta
{"title":"Protoda: Efficient Transfer Learning for Few-Shot Intent Classification","authors":"Manoj Kumar, Varun Kumar, Hadrien Glaude, Cyprien delichy, Aman Alok, Rahul Gupta","doi":"10.1109/SLT48900.2021.9383495","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383495","url":null,"abstract":"Practical sequence classification tasks in natural language processing often suffer from low training data availability for target classes. Recent works towards mitigating this problem have focused on transfer learning using embeddings pre-trained on often unrelated tasks, for instance, language modeling. We adopt an alternative approach by transfer learning on an ensemble of related tasks using prototypical networks under the meta-learning paradigm. Using intent classification as a case study, we demonstrate that increasing variability in training tasks can significantly improve classification performance. Further, we apply data augmentation in conjunction with meta-learning to reduce sampling bias. We make use of a conditional generator for data augmentation that is trained directly using the meta-learning objective and simultaneously with prototypical networks, hence ensuring that data augmentation is customized to the task. We explore augmentation in the sentence embedding space as well as prototypical embedding space. Combining meta-learning with augmentation provides upto 6.49% and 8.53% relative F1-score improvements over the best performing systems in the 5-shot and 10-shot learning, respectively.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127122752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Transformer Based Deliberation for Two-Pass Speech Recognition 基于变压器的二次语音识别算法
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383497
Ke Hu, Ruoming Pang, Tara N. Sainath, Trevor Strohman
{"title":"Transformer Based Deliberation for Two-Pass Speech Recognition","authors":"Ke Hu, Ruoming Pang, Tara N. Sainath, Trevor Strohman","doi":"10.1109/SLT48900.2021.9383497","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383497","url":null,"abstract":"Interactive speech recognition systems must generate words quickly while also producing accurate results. Two-pass models excel at these requirements by employing a first-pass decoder that quickly emits words, and a second-pass decoder that requires more context but is more accurate. Previous work has established that a deliberation network can be an effective second-pass model. The model attends to two kinds of inputs at once: encoded audio frames and the hypothesis text from the first-pass model. In this work, we explore using transformer layers instead of long-short term memory (LSTM) layers for deliberation rescoring. In transformer layers, we generalize the \"encoder-decoder\" attention to attend to both encoded audio and first-pass text hypotheses. The output context vectors are then combined by a merger layer. Compared to LSTM-based deliberation, our best transformer deliberation achieves 7% relative word error rate improvements along with a 38% reduction in computation. We also compare against non-deliberation transformer rescoring, and find a 9% relative improvement.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122028501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Improving Speech Recognition Accuracy of Local POI Using Geographical Models 利用地理模型提高局部POI语音识别精度
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383538
Songjun Cao, Yike Zhang, Xiaobing Feng, Long Ma
{"title":"Improving Speech Recognition Accuracy of Local POI Using Geographical Models","authors":"Songjun Cao, Yike Zhang, Xiaobing Feng, Long Ma","doi":"10.1109/SLT48900.2021.9383538","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383538","url":null,"abstract":"Nowadays voice search for points of interest (POI) is becoming increasingly popular. However, speech recognition for local POI names still remains a challenge due to multi-dialect and long-tailed distribution of POI names. This paper improves speech recognition accuracy for local POI from two aspects. Firstly, a geographic acoustic model (Geo-AM) is proposed. The proposed Geo-AM deals with multi-dialect problem using dialect-specific input feature and dialect-specific top layers. Secondly, a group of geo-specific language models (Geo-LMs) are integrated into our speech recognition system to improve recognition accuracy of long-tailed and homophone POI names. During decoding, a specific Geo-LM is selected on-demand according to the user’s geographic location. Experiments show that the proposed Geo-AM achieves 6.5%~10.1% relative character error rate (CER) reduction on an accent test set and the proposed Geo-AM and Geo-LMs totally achieve over 18.7% relative CER reduction on a voice search task for Tencent Map.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129689295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Investigations on audiovisual emotion recognition in noisy conditions 噪声条件下的视听情感识别研究
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383588
M. Neumann, Ngoc Thang Vu
{"title":"Investigations on audiovisual emotion recognition in noisy conditions","authors":"M. Neumann, Ngoc Thang Vu","doi":"10.1109/SLT48900.2021.9383588","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383588","url":null,"abstract":"In this paper we explore audiovisual emotion recognition under noisy acoustic conditions with a focus on speech features. We attempt to answer the following research questions: (i) How does speech emotion recognition perform on noisy data? and (ii) To what extend does a multimodal approach improve the accuracy and compensate for potential performance degradation at different noise levels?We present an analytical investigation on two emotion datasets with superimposed noise at different signal-to-noise ratios, comparing three types of acoustic features. Visual features are incorporated with a hybrid fusion approach: The first neural network layers are separate modality-specific ones, followed by at least one shared layer before the final prediction. The results show a significant performance decrease when a model trained on clean audio is applied to noisy data and that the addition of visual features alleviates this effect.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126885789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Cross-Demographic Portability of Deep NLP-Based Depression Models 基于深度nlp的抑郁模型的跨人口可移植性
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383609
T. Rutowski, Elizabeth Shriberg, A. Harati, Yang Lu, R. Oliveira, P. Chlebek
{"title":"Cross-Demographic Portability of Deep NLP-Based Depression Models","authors":"T. Rutowski, Elizabeth Shriberg, A. Harati, Yang Lu, R. Oliveira, P. Chlebek","doi":"10.1109/SLT48900.2021.9383609","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383609","url":null,"abstract":"Deep learning models are rapidly gaining interest for real-world applications in behavioral health. An important gap in current literature is how well such models generalize over different populations. We study Natural Language Processing (NLP) based models to explore portability over two different corpora highly mismatched in age. The first and larger corpus contains younger speakers. It is used to train an NLP model to predict depression. When testing on unseen speakers from the same age distribution, this model performs at AUC=0.82. We then test this model on the second corpus, which comprises seniors from a retirement community. Despite the large demographic differences in the two corpora, we saw only modest degradation in performance for the senior-corpus data, achieving AUC=0.76. Interestingly, in the senior population, we find AUC=0.81 for the subset of patients whose health state is consistent over time. Implications for demographic portability of speech-based applications are discussed.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121390988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Personalizing Speech Start Point and End Point Detection in ASR Systems from Speaker Embeddings 基于说话人嵌入的ASR系统个性化语音起点和终点检测
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383516
Aditya Jayasimha, Periyasamy Paramasivam
{"title":"Personalizing Speech Start Point and End Point Detection in ASR Systems from Speaker Embeddings","authors":"Aditya Jayasimha, Periyasamy Paramasivam","doi":"10.1109/SLT48900.2021.9383516","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383516","url":null,"abstract":"Start Point Detection (SPD) and End Point Detection (EPD) in Automatic Speech Recognition (ASR) systems are the tasks of detecting the time at which the user starts speaking and stops speaking respectively. They are crucial problems in ASR as inaccurate detection of SPD and/or EPD leads to poor ASR performance and bad user experience. The main challenge involved in SPD and EPD is accurate detection in noisy environments, especially when speech noise is significant in the background. The current approaches tend to fail to distinguish between the speech of the real user and speech in the background. In this work, we aim to improve SPD and EPD in a multi-speaker environment. We propose a novel approach that personalizes SPD and EPD to a desired user and helps improve ASR quality and latency. We combine user-specific information (i-vectors) with traditional speech features (log-mel) and build a Convolutional, Long Short-Term Memory, Deep Neural Network (CLDNN) model to achieve personalized SPD and EPD. The proposed approach achieves a relative improvement of 46.53% and 11.31% in SPD accuracy, and 27.87% and 5.31% in EPD accuracy at SNR 0 and 5 dB respectively over the standard non-personalized models.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122251503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Innovative Bert-Based Reranking Language Models for Speech Recognition 基于bert的语音识别语言重排序创新模型
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383557
Shih-Hsuan Chiu, Berlin Chen
{"title":"Innovative Bert-Based Reranking Language Models for Speech Recognition","authors":"Shih-Hsuan Chiu, Berlin Chen","doi":"10.1109/SLT48900.2021.9383557","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383557","url":null,"abstract":"More recently, Bidirectional Encoder Representations from Transformers (BERT) was proposed and has achieved impressive success on many natural language processing (NLP) tasks such as question answering and language understanding, due mainly to its effective pre-training then fine-tuning paradigm as well as strong local contextual modeling ability. In view of the above, this paper presents a novel instantiation of the BERT-based contextualized language models (LMs) for use in reranking of N-best hypotheses produced by automatic speech recognition (ASR). To this end, we frame N-best hypothesis reranking with BERT as a prediction problem, which aims to predict the oracle hypothesis that has the lowest word error rate (WER) given the N-best hypotheses (denoted by PBERT). In particular, we also explore to capitalize on task-specific global topic information in an unsupervised manner to assist PBERT in N-best hypothesis reranking (denoted by TPBERT). Extensive experiments conducted on the AMI benchmark corpus demonstrate the effectiveness and feasibility of our methods in comparison to the conventional autoregressive models like the recurrent neural network (RNN) and a recently proposed method that employed BERT to compute pseudo-log-likelihood (PLL) scores for N-best hypothesis reranking.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125876342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Unsupervised Acoustic-to-Articulatory Inversion Neural Network Learning Based on Deterministic Policy Gradient 基于确定性策略梯度的无监督声学-发音逆神经网络学习
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383554
Hayato Shibata, Mingxin Zhang, T. Shinozaki
{"title":"Unsupervised Acoustic-to-Articulatory Inversion Neural Network Learning Based on Deterministic Policy Gradient","authors":"Hayato Shibata, Mingxin Zhang, T. Shinozaki","doi":"10.1109/SLT48900.2021.9383554","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383554","url":null,"abstract":"This paper presents an unsupervised learning method of deep neural networks that perform acoustic-to-articulatory inversion for arbitrary utterances. Conventional unsupervised acoustic-to-articulatory inversion methods are based on the analysis-by-synthesis approach and non-linear optimization algorithms. One limitation is that they require time-consuming iterative optimizations to obtain articulatory parameters for a given target speech segment. Neural networks, after learning their relationship, can obtain these articulatory parameters without an iterative optimization. However, conventional methods need supervised learning and paired acoustic and articulatory samples. We propose a hybrid auto-encoder based unsupervised learning framework for the acoustic-to-articulatory inversion neural networks that can capture context information. The essential point of the framework is making the training effective. We investigate several reinforcement learning algorithms and show the usefulness of the deterministic policy gradient. Experimental results demonstrate that the proposed method can infer articulatory parameters not only for training set segments but also for unseen utterances. Averaged reconstruction errors achieved for open test samples are similar to or even lower than the conventional method that directly optimizes the articulatory parameters in a closed condition.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128344408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Embedding Aggregation for Far-Field Speaker Verification with Distributed Microphone Arrays 分布式麦克风阵列远场扬声器验证的嵌入聚合
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383501
Danwei Cai, Ming Li
{"title":"Embedding Aggregation for Far-Field Speaker Verification with Distributed Microphone Arrays","authors":"Danwei Cai, Ming Li","doi":"10.1109/SLT48900.2021.9383501","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383501","url":null,"abstract":"With the successful application of deep speaker embedding networks, the performance of speaker verification systems has significantly improved under clean and close-talking settings; however, unsatisfactory performance persists under noisy and far-field environments. This study aims at improving the performance of far-field speaker verification systems with distributed microphone arrays in the smart home scenario. The proposed learning framework consists of two modules: a deep speaker embedding module and an aggregation module. The former extracts a speaker embedding for each recording. The latter, based on either averaged pooling or attentive pooling, aggregates speaker embeddings and learns a unified representation for all recordings captured by distributed microphone arrays. The two modules are trained in an end-to-end manner. To evaluate this framework, we conduct experiments on the real text-dependent far-field datasets Hi Mia. Results show that our framework outperforms the naive averaged aggregation methods by 20% in terms of equal error rate (EER) with six distributed microphone arrays. Also, we find that the attention-based aggregation advocates high-quality recordings and repels low-quality ones.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130690166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信