2021 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream end-to-end ASR 两阶段增强和自适应CTC融合提高多流端到端ASR鲁棒性
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383621
Ruizhi Li, Gregory Sell, H. Hermansky
{"title":"Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream end-to-end ASR","authors":"Ruizhi Li, Gregory Sell, H. Hermansky","doi":"10.1109/SLT48900.2021.9383621","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383621","url":null,"abstract":"Performance degradation of an Automatic Speech Recognition (ASR) system is commonly observed when the test acoustic condition is different from training. Hence, it is essential to make ASR systems robust against various environmental distortions, such as background noises and reverberations. In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics. Previously, a practical two-stage training strategy was proposed within multi-stream end-to-end ASR, where Stage-2 formulates the multi-stream model with features from Stage-1 Universal Feature Extractor (UFE). In this paper, as an extension, we introduce a two-stage augmentation scheme focusing on mismatch scenarios: Stage-1 Augmentation aims to address single-stream input varieties with data augmentation techniques; Stage-2 Time Masking applies temporal masks on UFE features of randomly selected streams to simulate diverse stream combinations. During inference, we also present adaptive Connectionist Temporal Classification (CTC) fusion with the help of hierarchical attention mechanisms. Experiments have been conducted on two datasets, DIRHA and AMI, as a multi-stream scenario. Compared with the previous training strategy, substantial improvements are reported with relative word error rate reductions of 29.7 − 59.3% across several unseen stream combinations.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114882311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Noise-Robust Spoken Language Identification Using Language Relevance Factor Based Embedding 基于语言相关因子嵌入的噪声鲁棒口语识别
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383503
H. Muralikrishna, Shikha Gupta, Dileep Aroor Dinesh, Padmanabhan Rajan
{"title":"Noise-Robust Spoken Language Identification Using Language Relevance Factor Based Embedding","authors":"H. Muralikrishna, Shikha Gupta, Dileep Aroor Dinesh, Padmanabhan Rajan","doi":"10.1109/SLT48900.2021.9383503","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383503","url":null,"abstract":"State-of-the-art systems for spoken language identification (LID) use i-vector or embedding extracted using a deep neural network (DNN) to represent the utterance. These fixed-length representations are obtained without explicitly considering the relevance of individual frame-level feature vectors in deciding the class label. In this paper, we propose a new method to represent the utterance that considers the relevance of the individual frame-level features. The proposed representation can also preserve the locally available LID-specific information in the input features to some extent. To better utilize the local-level information in the new representation, we propose a novel segment-level matching kernel based support vector machine (SVM) classifier. The proposed representation of the utterance based on the relevance of frame-level features improves the robustness of the LID system to different background noise conditions in the speech. The experiments conducted on speech with different background conditions show that the proposed approach performs better than state-of-the-art approaches in noisy speech and performs similarly to the state-of-the-art systems in clean speech condition.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125377807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Acoustic Word Embeddings for Zero-Resource Languages Using Self-Supervised Contrastive Learning and Multilingual Adaptation 基于自监督对比学习和多语言适应的零资源语言声学词嵌入
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383594
C. Jacobs, Yevgen Matusevych, H. Kamper
{"title":"Acoustic Word Embeddings for Zero-Resource Languages Using Self-Supervised Contrastive Learning and Multilingual Adaptation","authors":"C. Jacobs, Yevgen Matusevych, H. Kamper","doi":"10.1109/SLT48900.2021.9383594","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383594","url":null,"abstract":"Acoustic word embeddings (AWEs) are fixed-dimensional representations of variable-length speech segments. For zero-resource languages where labelled data is not available, one AWE approach is to use unsupervised autoencoder-based re-current models. Another recent approach is to use multilingual transfer: a supervised AWE model is trained on several well-resourced languages and then applied to an unseen zero-resource language. We consider how a recent contrastive learning loss can be used in both the purely unsupervised and multilingual transfer settings. Firstly, we show that terms from an unsupervised term discovery system can be used for contrastive self-supervision, resulting in improvements over previous unsupervised monolingual AWE models. Secondly, we consider how multilingual AWE models can be adapted to a specific zero-resource language using discovered terms. We find that self-supervised contrastive adaptation outperforms adapted multilingual correspondence autoencoder and Siamese AWE models, giving the best overall results in a word discrimination task on six zero-resource languages.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130023951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Incorporating Discriminative DPGMM Posteriorgrams for Low-Resource ASR 基于鉴别DPGMM后图的低资源ASR研究
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383597
Bin Wu, S. Sakti, Satoshi Nakamura
{"title":"Incorporating Discriminative DPGMM Posteriorgrams for Low-Resource ASR","authors":"Bin Wu, S. Sakti, Satoshi Nakamura","doi":"10.1109/SLT48900.2021.9383597","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383597","url":null,"abstract":"The first step in building an ASR system is to extract proper speech features. The ideal speech features for ASR must also have high discriminabilities between linguistic units and be robust to such non-linguistic factors as gender, age, emotions, or noise. The discriminabilities of various features have been compared in several Zerospeech challenges to discover linguistic units without any transcriptions, in which the posteriorgrams of DPGMM clustering show strong discriminability and get several top results of ABX discrimination scores between phonemes. This paper appends DPGMM posteriorgrams to increase the discriminability of acoustic features to enhance ASR systems. To the best of our knowledge, DPGMM features, which are usually applied to such tasks as spoken term detection and zero resources tasks, have not been applied to large vocabulary continuous speech recognition (LVCSR) before. DPGMM clustering can dynamically change the number of Gaussians until each one fits one segmental pattern of the whole speech corpus with the highest probability such that the linguistic units of different segmental patterns are clearly discriminated. Our experimental results on the WSJ corpora show our proposal stably improves ASR systems and provides even more improvement for smaller datasets with fewer resources.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134455387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing the Intelligibility of Cleft Lip and Palate Speech Using Cycle-Consistent Adversarial Networks 利用循环一致对抗网络提高唇腭裂语音的可理解性
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383543
Protima Nomo Sudro, Rohan Kumar Das, R. Sinha, S. Prasanna
{"title":"Enhancing the Intelligibility of Cleft Lip and Palate Speech Using Cycle-Consistent Adversarial Networks","authors":"Protima Nomo Sudro, Rohan Kumar Das, R. Sinha, S. Prasanna","doi":"10.1109/SLT48900.2021.9383543","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383543","url":null,"abstract":"Cleft lip and palate (CLP) refer to a congenital craniofacial condition that causes various speech-related disorders. As a result of structural and functional deformities, the affected subjects’ speech intelligibility is significantly degraded, limiting the accessibility and usability of speech-controlled devices. Towards addressing this problem, it is desirable to improve the CLP speech intelligibility. Moreover, it would be useful during speech therapy. In this study, the cycle-consistent adversarial network (CycleGAN) method is exploited for improving CLP speech intelligibility. The model is trained on native Kannada-speaking childrens’ speech data. The effectiveness of the proposed approach is also measured using automatic speech recognition performance. Further, subjective evaluation is performed, and those results also confirm the intelligibility improvement in the enhanced speech over the original.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131000257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Uncertainty-Aware Representations for Spoken Question Answering 口语问答的不确定性感知表征
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383547
Merve Ünlü Menevşe, E. Arisoy
{"title":"Uncertainty-Aware Representations for Spoken Question Answering","authors":"Merve Ünlü Menevşe, E. Arisoy","doi":"10.1109/SLT48900.2021.9383547","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383547","url":null,"abstract":"This paper describes a spoken question answering system that utilizes the uncertainty in automatic speech recognition (ASR) to mitigate the effect of ASR errors on question answering. Spoken question answering is typically performed by transcribing spoken con-tent with an ASR system and then applying text-based question answering methods to the ASR transcriptions. Question answering on spoken documents is more challenging than question answering on text documents since ASR transcriptions can be erroneous and this degrades the system performance. In this paper, we propose integrating confusion networks with word confidence scores into an end-to-end neural network-based question answering system that works on ASR transcriptions. Integration is performed by generating uncertainty-aware embedding representations from confusion networks. The proposed approach improves F1 score in a question answering task developed for spoken lectures by providing tighter integration of ASR and question answering.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117287443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Acoustic Modeling for Multi-Array Conversational Speech Recognition in the Chime-6 Challenge Chime-6挑战中多阵列对话语音识别的声学建模
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383628
Li Chai, Jun Du, Diyuan Liu, Yanhui Tu, Chin-Hui Lee
{"title":"Acoustic Modeling for Multi-Array Conversational Speech Recognition in the Chime-6 Challenge","authors":"Li Chai, Jun Du, Diyuan Liu, Yanhui Tu, Chin-Hui Lee","doi":"10.1109/SLT48900.2021.9383628","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383628","url":null,"abstract":"This paper presents our main contributions of acoustic modeling for multi-array multi-talker speech recognition in the CHiME-6 Challenge, exploring different strategies for acoustic data augmentation and neural network architectures. First, enhanced data from our front-end network preprocessing and spectral augmentation are investigated to be effective for improving speech recognition performance. Second, several neural network architectures are explored by different combinations of deep residual network (ResNet), factorized time delay neural network (TDNNF) and residual bidirectional long short-term memory (RBiLSTM). Finally, multiple acoustic models can be combined via minimum Bayes risk fusion. Compared with the official baseline acoustic model, the proposed solution can achieve a relatively word error rate reduction of 19% for the best single ASR system on the evaluation data, which is also one of main contributions to our top system for the Track 1 tasks of the CHiME-6 Challenge.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130953481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Improving L2 English Rhythm Evaluation with Automatic Sentence Stress Detection 用句子重音自动检测改进二语英语节奏评价
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383455
Binghuai Lin, Liyuan Wang, Hongwei Ding, Xiaoli Feng
{"title":"Improving L2 English Rhythm Evaluation with Automatic Sentence Stress Detection","authors":"Binghuai Lin, Liyuan Wang, Hongwei Ding, Xiaoli Feng","doi":"10.1109/SLT48900.2021.9383455","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383455","url":null,"abstract":"English is a stress-timed language, for which sentence stress or prosodic stress plays an important role. It’s then difficult for Chinese who are used to the syllable-timed rhythm to learn the rhythm of English [1]. In this paper, we investigate how to improve the rhythm evaluation based on the sentence stress for Chinese who learn English as a second language (ESL). Particularly, we explore some rhythm measures to quantify rhythmic differences among second language (L2) learners based on sentence stress. To relieve the dependency on labeled data of sentence stress, we predict sentence stress automatically utilizing a hierarchical network with bidirectional Long Short-Term Memory (BLSTM) [2]. We evaluate the proposed method based on the corpus consisting of 3,500 sentences recorded by 100 Chinese speakers aging from 10 to 20 years old, which was marked with the sentence stress labels and scored by three experts. Experimental results show the proposed sentence stress measure is well correlated with labeled prosody scores with a correlation coefficient of −0.73 and the automatic labeling method achieves comparable results with the method with gold labels.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124141293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Supervised and unsupervised approaches for controlling narrow lexical focus in sequence-to-sequence speech synthesis 序列到序列语音合成中控制狭窄词汇焦点的监督与非监督方法
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383591
Slava Shechtman, Raul Fernandez, D. Haws
{"title":"Supervised and unsupervised approaches for controlling narrow lexical focus in sequence-to-sequence speech synthesis","authors":"Slava Shechtman, Raul Fernandez, D. Haws","doi":"10.1109/SLT48900.2021.9383591","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383591","url":null,"abstract":"Although Sequence-to-Sequence (S2S) architectures have become state-of-the-art in speech synthesis, capable of generating outputs that approach the perceptual quality of natural samples, they are limited by a lack of flexibility when it comes to controlling the output. In this work we present a framework capable of controlling the prosodic output via a set of concise, interpretable, disentangled parameters. We apply this framework to the realization of emphatic lexical focus, proposing a variety of architectures designed to exploit different levels of supervision based on the availability of labeled resources. We evaluate these approaches via listening tests that demonstrate we are able to successfully realize controllable focus while maintaining the same, or higher, naturalness over an established baseline, and we explore how the different approaches compare when synthesizing in a target voice with or without labeled data.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115851629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Go Beyond Plain Fine-Tuning: Improving Pretrained Models for Social Commonsense 超越简单的微调:改进社会常识的预训练模型
2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-19 DOI: 10.1109/SLT48900.2021.9383453
Ting-Yun Chang, Yang Liu, Karthik Gopalakrishnan, Behnam Hedayatnia, Pei Zhou, Dilek Z. Hakkani-Tür
{"title":"Go Beyond Plain Fine-Tuning: Improving Pretrained Models for Social Commonsense","authors":"Ting-Yun Chang, Yang Liu, Karthik Gopalakrishnan, Behnam Hedayatnia, Pei Zhou, Dilek Z. Hakkani-Tür","doi":"10.1109/SLT48900.2021.9383453","DOIUrl":"https://doi.org/10.1109/SLT48900.2021.9383453","url":null,"abstract":"Pretrained language models have demonstrated outstanding performance in many NLP tasks recently. However, their social intelligence, which requires commonsense reasoning about the current situation and mental states of others, is still developing. Towards improving language models’ social intelligence, in this study we focus on the Social IQA dataset, a task requiring social and emotional commonsense reasoning. Building on top of the pretrained RoBERTa and GPT2 models, we propose several architecture variations and extensions, as well as leveraging external commonsense corpora, to optimize the model for Social IQA. Our proposed system achieves competitive results as those top-ranking models on the leaderboard. This work demonstrates the strengths of pretrained language models, and provides viable ways to improve their performance for a particular task.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122545351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信