2022 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Exploring WavLM on Speech Enhancement 探索WavLM在语音增强上的应用
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-18 DOI: 10.1109/SLT54892.2023.10023356
Hyungchan Song, Sanyuan Chen, Zhuo Chen, Yu Wu, Takuya Yoshioka, M. Tang, Jong Won Shin, Shujie Liu
{"title":"Exploring WavLM on Speech Enhancement","authors":"Hyungchan Song, Sanyuan Chen, Zhuo Chen, Yu Wu, Takuya Yoshioka, M. Tang, Jong Won Shin, Shujie Liu","doi":"10.1109/SLT54892.2023.10023356","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023356","url":null,"abstract":"There is a surge in interest in self-supervised learning approaches for end-to-end speech encoding in recent years as they have achieved great success. Especially, WavLM showed state-of-the-art performance on various speech processing tasks. To better understand the efficacy of self-supervised learning models for speech enhancement, in this work, we design and conduct a series of experiments with three resource conditions by combining WavLM and two high-quality speech enhancement systems. Also, We propose a regression-based WavLM training objective and a noise-mixing data configuration to further boost the downstream enhancement performance. The experiments on the DNS challenge dataset and a simulation dataset show that the WavLM benefits the speech enhancement task in terms of both speech quality and speech recognition accuracy, especially for low fine-tuning resources. For the high fine-tuning resource condition, only the word error rate is substantially improved.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127275507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Study on the Integration of Pre-Trained SSL, ASR, LM and SLU Models for Spoken Language Understanding 预训练SSL、ASR、LM和SLU模型在口语理解中的整合研究
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-10 DOI: 10.1109/SLT54892.2023.10022399
Yifan Peng, Siddhant Arora, Yosuke Higuchi, Yushi Ueda, Sujay S. Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe
{"title":"A Study on the Integration of Pre-Trained SSL, ASR, LM and SLU Models for Spoken Language Understanding","authors":"Yifan Peng, Siddhant Arora, Yosuke Higuchi, Yushi Ueda, Sujay S. Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe","doi":"10.1109/SLT54892.2023.10022399","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022399","url":null,"abstract":"Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of un-paired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively.11Our code and models will be publicly available as part of the ESPnet-SLU toolkit.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114413273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Remap, Warp and Attend: Non-Parallel Many-to-Many Accent Conversion with Normalizing Flows
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-10 DOI: 10.1109/SLT54892.2023.10022506
Abdelhamid Ezzerg, Thomas Merritt, K. Yanagisawa, P. Bilinski, Magdalena Proszewska, Kamil Pokora, Renard Korzeniowski, R. Barra-Chicote, Daniel Korzekwa
{"title":"Remap, Warp and Attend: Non-Parallel Many-to-Many Accent Conversion with Normalizing Flows","authors":"Abdelhamid Ezzerg, Thomas Merritt, K. Yanagisawa, P. Bilinski, Magdalena Proszewska, Kamil Pokora, Renard Korzeniowski, R. Barra-Chicote, Daniel Korzekwa","doi":"10.1109/SLT54892.2023.10022506","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022506","url":null,"abstract":"Regional accents of the same language affect not only how words are pronounced (i.e., phonetic content), but also impact prosodic aspects of speech such as speaking rate and intonation. This paper investigates a novel flow-based approach to accent conversion using normalizing flows. The proposed approach revolves around three steps: remapping the phonetic conditioning, to better match the target accent, warping the duration of the converted speech, to better suit the target phonemes, and an attention mechanism that implicitly aligns source and target speech sequences. The proposed remap-warp-attend system enables adaptation of both phonetic and prosodic aspects of speech while allowing for source and converted speech signals to be of different lengths. Objective and subjective evaluations show that the proposed approach significantly outperforms a competitive CopyCat baseline model in terms of similarity to the target accent, naturalness and intelligibility.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130917776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Distribution-Based Emotion Recognition in Conversation 会话中基于分布的情绪识别
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-09 DOI: 10.1109/SLT54892.2023.10022800
Wen Wu, C. Zhang, P. Woodland
{"title":"Distribution-Based Emotion Recognition in Conversation","authors":"Wen Wu, C. Zhang, P. Woodland","doi":"10.1109/SLT54892.2023.10022800","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022800","url":null,"abstract":"Automatic emotion recognition in conversation (ERC) is crucial for emotion-aware conversational artificial intelligence. This paper proposes a distribution-based framework that formulates ERC as a sequence-to-sequence problem for emotion distribution estimation. The inherent ambiguity of emotions and the subjectivity of human perception lead to disagreements in emotion labels, which is handled naturally in our framework from the perspective of uncertainty estimation in emotion distributions. A Bayesian training loss is introduced to improve the uncertainty estimation by conditioning each emotional state on an utterance-specific Dirichlet prior distribution. Experimental results on the IEMOCAP dataset show that ERC outperformed the single-utterance-based system, and the proposed distribution-based ERC methods have not only better classification accuracy, but also show improved uncertainty estimation.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128913762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Streaming, Fast and Accurate on-Device Inverse Text Normalization for Automatic Speech Recognition 流式,快速和准确的设备上的反向文本归一化自动语音识别
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-07 DOI: 10.1109/SLT54892.2023.10022543
Yashesh Gaur, Nick Kibre, Jian Xue, Kangyuan Shu, Yuhui Wang, Issac Alphonso, Jinyu Li, Y. Gong
{"title":"Streaming, Fast and Accurate on-Device Inverse Text Normalization for Automatic Speech Recognition","authors":"Yashesh Gaur, Nick Kibre, Jian Xue, Kangyuan Shu, Yuhui Wang, Issac Alphonso, Jinyu Li, Y. Gong","doi":"10.1109/SLT54892.2023.10022543","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022543","url":null,"abstract":"Automatic Speech Recognition (ASR) systems typically yield output in lexical form. However, humans prefer a written form output. To bridge this gap, ASR systems usually employ Inverse Text Normalization (ITN). In previous works, Weighted Finite State Transducers (WFST) have been employed to do ITN. WFSTs are nicely suited to this task but their size and run-time costs can make deployment on embedded applications challenging. In this paper, we describe the development of an on-device ITN system that is streaming, lightweight & accurate. At the core of our system is a streaming transformer tagger, that tags lexical tokens from ASR. The tag informs which ITN category might be applied, if at all. Following that, we apply an ITN-category-specific WFST, only on the tagged text, to reliably perform the ITN conversion. We show that the proposed ITN solution performs equivalent to strong base-lines, while being significantly smaller in size and retaining customization capabilities.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124166924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Phoneme Segmentation Using Self-Supervised Speech Models 基于自监督语音模型的音素分割
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-02 DOI: 10.1109/SLT54892.2023.10022827
Luke Strgar, David F. Harwath
{"title":"Phoneme Segmentation Using Self-Supervised Speech Models","authors":"Luke Strgar, David F. Harwath","doi":"10.1109/SLT54892.2023.10022827","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022827","url":null,"abstract":"We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task. Our model extends transformer-style encoders with strategically placed convolutions that manipulate features learned in pre-training. Using the TIMIT and Buckeye corpora we train and test the model in the supervised and unsupervised settings. The latter case is accomplished by furnishing a noisy label-set with the predictions of a separate model, it having been trained in an unsupervised fashion. Results indicate our model eclipses previous state-of-the-art performance in both settings and on both datasets. Finally, following observations during published code review and attempts to reproduce past segmentation results, we find a need to disambiguate the definition and implementation of widely-used evaluation metrics. We resolve this ambiguity by delineating two distinct evaluation schemes and describing their nuances.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124446312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
SIMD-Size Aware Weight Regularization for Fast Neural Vocoding on CPU 基于CPU的快速神经语音编码SIMD-Size感知权值正则化
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-02 DOI: 10.1109/SLT54892.2023.10022757
Hiroki Kanagawa, Yusuke Ijima
{"title":"SIMD-Size Aware Weight Regularization for Fast Neural Vocoding on CPU","authors":"Hiroki Kanagawa, Yusuke Ijima","doi":"10.1109/SLT54892.2023.10022757","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022757","url":null,"abstract":"This paper proposes weight regularization for a faster neural vocoder. Pruning time-consuming DNN modules is a promising way to realize a real-time vocoder on a CPU (e.g. WaveRNN, LPCNet). Regularization that encourages sparsity is also effective in avoiding the quality degradation created by pruning. However, the orders of weight matrices must be contiguous in SIMD size for fast vocoding. To ensure this order, we propose explicit SIMD size aware regularization. Our proposed method reshapes a weight matrix into a tensor so that the weights are aligned by group size in advance, and then computes the group Lasso-like regularization loss. Experiments on 70% sparse subband WaveRNN show that pruning in conventional Lasso and column-wise group Lasso degrades the synthetic speech's naturalness. The vocoder with proposed regularization 1) achieves comparable naturalness to that without pruning and 2) performs meaningfully faster than other conventional vocoders using regularization.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126558449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems 统一的端到端语音识别和端点快速高效的语音系统
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-01 DOI: 10.1109/SLT54892.2023.10022338
Shaan Bijwadia, Shuo-yiin Chang, Bo Li, Tara N. Sainath, Chaoyang Zhang, Yanzhang He
{"title":"Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems","authors":"Shaan Bijwadia, Shuo-yiin Chang, Bo Li, Tara N. Sainath, Chaoyang Zhang, Yanzhang He","doi":"10.1109/SLT54892.2023.10022338","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022338","url":null,"abstract":"Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. In this work, we propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a “switch” connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This results in a single E2E model that can be used during inference to perform frame filtering at low cost, and also make high quality end-of-query (EOQ) predictions based on ongoing ASR computation. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 120 ms (30.8% reduction), and 90th percentile latency by 170 ms (23.0% reduction), without regressing word error rate. For continuous recognition, WER improves by 10.6% (relative).","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126752527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Multilingual Speech Emotion Recognition with Multi-Gating Mechanism and Neural Architecture Search 基于多门控机制和神经结构搜索的多语言语音情感识别
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-31 DOI: 10.1109/SLT54892.2023.10022557
Zihan Wang, Qianyu Meng, HaiFeng Lan, Xinrui Zhang, Kehao Guo, Akshat Gupta
{"title":"Multilingual Speech Emotion Recognition with Multi-Gating Mechanism and Neural Architecture Search","authors":"Zihan Wang, Qianyu Meng, HaiFeng Lan, Xinrui Zhang, Kehao Guo, Akshat Gupta","doi":"10.1109/SLT54892.2023.10022557","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022557","url":null,"abstract":"Speech emotion recognition (SER) classifies audio into emotion categories such as Happy, Angry, Fear, Disgust and Neutral. While Speech Emotion Recognition (SER) is a common application for popular languages, it continues to be a problem for low-resourced languages, i.e., languages with no pre-trained speech-to-text recognition models. This paper firstly proposes a language-specific model that extract emotional information from multiple pre-trained speech models, and then designs a multi-domain model that simultaneously performs SER for various languages. Our multi-domain model employs a multi-gating mechanism to generate unique weighted feature combination for each language, and also searches for specific neural network structure for each language through a neural architecture search module. In addition, we introduce a contrastive auxiliary loss to build more separable rep-resentations for audio data. Our experiments show that our model raises the state-of-the-art accuracy by 3% for German and 14.3% for French.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129491314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Modular Hybrid Autoregressive Transducer 模块化混合自回归传感器
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-31 DOI: 10.1109/SLT54892.2023.10023194
Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, B. Ramabhadran, W. R. Huang, Ehsan Variani, Yinghui Huang, P. Moreno
{"title":"Modular Hybrid Autoregressive Transducer","authors":"Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, B. Ramabhadran, W. R. Huang, Ehsan Variani, Yinghui Huang, P. Moreno","doi":"10.1109/SLT54892.2023.10023194","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023194","url":null,"abstract":"Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition since the transducer has no clearly separated acoustic model (AM), language model (LM) or blank model. In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a shared acoustic encoder. The encoder and label decoder outputs are directly projected to AM and internal LM scores and then added to compute label posteriors. We train MHAT with an internal LM loss and a HAT loss to ensure that its internal LM becomes a standalone neural LM that can be effectively adapted to text. Moreover, text adaptation of MHAT fosters a much better LM fusion than internal LM subtraction-based methods. On Google's large-scale production data, a multi-domain MHAT adapted with 100B sentences achieves relative WER reductions of up to 12.4% without LM fusion and 21.5% with LM fusion from 400K-hour trained HAT.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131334538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信