2022 IEEE Spoken Language Technology Workshop (SLT)最新文献_第8页

Exploring WavLM on Speech Enhancement 探索WavLM在语音增强上的应用

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-18 DOI: 10.1109/SLT54892.2023.10023356

Hyungchan Song, Sanyuan Chen, Zhuo Chen, Yu Wu, Takuya Yoshioka, M. Tang, Jong Won Shin, Shujie Liu

引用次数: 7

A Study on the Integration of Pre-Trained SSL, ASR, LM and SLU Models for Spoken Language Understanding 预训练SSL、ASR、LM和SLU模型在口语理解中的整合研究

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-10 DOI: 10.1109/SLT54892.2023.10022399

Yifan Peng, Siddhant Arora, Yosuke Higuchi, Yushi Ueda, Sujay S. Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe

{"title":"A Study on the Integration of Pre-Trained SSL, ASR, LM and SLU Models for Spoken Language Understanding","authors":"Yifan Peng, Siddhant Arora, Yosuke Higuchi, Yushi Ueda, Sujay S. Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe","doi":"10.1109/SLT54892.2023.10022399","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022399","url":null,"abstract":"Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and their combinations for SLU. We leverage self-supervised speech and language models (LM) pre-trained on large quantities of un-paired data to extract strong speech and text representations. We also explore using supervised models pre-trained on larger external automatic speech recognition (ASR) or SLU corpora. We conduct extensive experiments on the SLU Evaluation (SLUE) benchmark and observe self-supervised pre-trained models to be more powerful, with pre-trained LM and speech models being most beneficial for the Sentiment Analysis and Named Entity Recognition task, respectively.11Our code and models will be publicly available as part of the ESPnet-SLU toolkit.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114413273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Remap, Warp and Attend: Non-Parallel Many-to-Many Accent Conversion with Normalizing Flows

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-10 DOI: 10.1109/SLT54892.2023.10022506

Abdelhamid Ezzerg, Thomas Merritt, K. Yanagisawa, P. Bilinski, Magdalena Proszewska, Kamil Pokora, Renard Korzeniowski, R. Barra-Chicote, Daniel Korzekwa

引用次数: 3

Distribution-Based Emotion Recognition in Conversation 会话中基于分布的情绪识别

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-09 DOI: 10.1109/SLT54892.2023.10022800

Wen Wu, C. Zhang, P. Woodland

引用次数: 1

Streaming, Fast and Accurate on-Device Inverse Text Normalization for Automatic Speech Recognition 流式，快速和准确的设备上的反向文本归一化自动语音识别

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-07 DOI: 10.1109/SLT54892.2023.10022543

Yashesh Gaur, Nick Kibre, Jian Xue, Kangyuan Shu, Yuhui Wang, Issac Alphonso, Jinyu Li, Y. Gong

引用次数: 3

Phoneme Segmentation Using Self-Supervised Speech Models 基于自监督语音模型的音素分割

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-02 DOI: 10.1109/SLT54892.2023.10022827

Luke Strgar, David F. Harwath

引用次数: 3

SIMD-Size Aware Weight Regularization for Fast Neural Vocoding on CPU 基于CPU的快速神经语音编码SIMD-Size感知权值正则化

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-02 DOI: 10.1109/SLT54892.2023.10022757

Hiroki Kanagawa, Yusuke Ijima

引用次数: 0

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems 统一的端到端语音识别和端点快速高效的语音系统

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-11-01 DOI: 10.1109/SLT54892.2023.10022338

Shaan Bijwadia, Shuo-yiin Chang, Bo Li, Tara N. Sainath, Chaoyang Zhang, Yanzhang He

引用次数: 6

Multilingual Speech Emotion Recognition with Multi-Gating Mechanism and Neural Architecture Search 基于多门控机制和神经结构搜索的多语言语音情感识别

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-31 DOI: 10.1109/SLT54892.2023.10022557

Zihan Wang, Qianyu Meng, HaiFeng Lan, Xinrui Zhang, Kehao Guo, Akshat Gupta

引用次数: 2

Modular Hybrid Autoregressive Transducer 模块化混合自回归传感器

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-31 DOI: 10.1109/SLT54892.2023.10023194

Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, B. Ramabhadran, W. R. Huang, Ehsan Variani, Yinghui Huang, P. Moreno

{"title":"Modular Hybrid Autoregressive Transducer","authors":"Zhong Meng, Tongzhou Chen, Rohit Prabhavalkar, Yu Zhang, Gary Wang, Kartik Audhkhasi, Jesse Emond, Trevor Strohman, B. Ramabhadran, W. R. Huang, Ehsan Variani, Yinghui Huang, P. Moreno","doi":"10.1109/SLT54892.2023.10023194","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023194","url":null,"abstract":"Text-only adaptation of a transducer model remains challenging for end-to-end speech recognition since the transducer has no clearly separated acoustic model (AM), language model (LM) or blank model. In this work, we propose a modular hybrid autoregressive transducer (MHAT) that has structurally separated label and blank decoders to predict label and blank distributions, respectively, along with a shared acoustic encoder. The encoder and label decoder outputs are directly projected to AM and internal LM scores and then added to compute label posteriors. We train MHAT with an internal LM loss and a HAT loss to ensure that its internal LM becomes a standalone neural LM that can be effectively adapted to text. Moreover, text adaptation of MHAT fosters a much better LM fusion than internal LM subtraction-based methods. On Google's large-scale production data, a multi-domain MHAT adapted with 100B sentences achieves relative WER reductions of up to 12.4% without LM fusion and 21.5% with LM fusion from 400K-hour trained HAT.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131334538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13