2022 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Mixture of Domain Experts for Language Understanding: an Analysis of Modularity, Task Performance, and Memory Tradeoffs 语言理解领域专家的混合:模块化、任务性能和内存权衡的分析
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022866
Benjamin Kleiner, Jack G. M. FitzGerald, Haidar Khan, Gohkan Tur
{"title":"Mixture of Domain Experts for Language Understanding: an Analysis of Modularity, Task Performance, and Memory Tradeoffs","authors":"Benjamin Kleiner, Jack G. M. FitzGerald, Haidar Khan, Gohkan Tur","doi":"10.1109/SLT54892.2023.10022866","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022866","url":null,"abstract":"One of the limitations of large-scale machine learning models is that they are difficult to adjust after deployment without significant re-training costs. In this paper, we focus on NLU and the needs of virtual assistant systems to continually update themselves through time to support new functionality. Specifically, we consider the tasks of intent classification (IC) and slot filling (SF), which are fundamental to processing user interaction with virtual assistants. We studied six different architectures with varying degrees of modularity in order to gain insights into the performance implications of designing models for flexible updates through time. Our experiments on the SLURP dataset, modified to simulate the real-world experience of adding new intents over time, show that a single dense model yields 2.5 — 3.5 points of average improvement versus individual domain models, but suffers a median degradation of 0.4 — 1.1 points as the new intents are incorporated. We present a mixture-of-experts based hybrid system that performs within 2.1 points of the dense model in exact match accuracy while either improving median performance for untouched domains through time or only degrading by 0.1 points at worst.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132859080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Luxembourgish Speech Recognition with Cross-Lingual Speech Representations 用跨语言语音表示改进卢森堡语语音识别
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022706
Le-Minh Nguyen, Shekhar Nayak, M. Coler
{"title":"Improving Luxembourgish Speech Recognition with Cross-Lingual Speech Representations","authors":"Le-Minh Nguyen, Shekhar Nayak, M. Coler","doi":"10.1109/SLT54892.2023.10022706","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022706","url":null,"abstract":"Luxembourgish is a West Germanic language spoken by roughly 390,000 people, mainly in Luxembourg. It is one of Europe's under-described and under-resourced languages, not extensively investigated in the context of speech recognition. We explore the self-supervised multilingual learning of Luxembourgish speech representations for the speech recognition downstream task. We show that learning cross-lingual representations is essential for low-resourced languages such as Luxembourgish. Learning cross-lingual representations and rescoring the output transcriptions with language modelling while using only 4 hours of labelled speech achieves a word error rate of 15.1% and improves our Transfer Learning baseline model relatively by 33.1% and absolutely by 7.5%. Increasing the amount of labelled speech to 14 hours yields a significant performance gain resulting in a 9.3% word error rate.11Models and datasets are available at https://hugging£ace.co/lemswasabi","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132322966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Investigating the Important Temporal Modulations for Deep-Learning-Based Speech Activity Detection 研究基于深度学习的语音活动检测的重要时间调制
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022462
Tyler Vuong, Nikhil Madaan, Rohan Panda, R. Stern
{"title":"Investigating the Important Temporal Modulations for Deep-Learning-Based Speech Activity Detection","authors":"Tyler Vuong, Nikhil Madaan, Rohan Panda, R. Stern","doi":"10.1109/SLT54892.2023.10022462","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022462","url":null,"abstract":"We describe a learnable modulation spectrogram feature for speech activity detection (SAD). Modulation features capture the temporal dynamics of each frequency subband. We compute learnable modulation spectrogram features by first calculating the log-mel spectrogram. Next, we filter each frequency subband with a bandpass filter that contains a learnable center frequency. The resulting SAD system was evaluated on the Fearless Steps Phase-04 SAD challenge. Experimental results showed that temporal modulations around the 4–6 Hz range are crucial for deep-learning-based SAD. These experimental results align with previous studies that found slow temporal modulation to be most important for speech-processing tasks and speech intelligibility. Additionally, we found that the learnable modulation spectrogram feature outperforms both the standard log-mel and fixed modulation spectrogram features on the Fearless Steps Phase-04 SAD test set.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130843529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automatic Rating of Spontaneous Speech for Low-Resource Languages 低资源语言自发语音自动评分
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022381
Ragheb Al-Ghezi, Yaroslav Getman, Ekaterina Voskoboinik, Mittul Singh, M. Kurimo
{"title":"Automatic Rating of Spontaneous Speech for Low-Resource Languages","authors":"Ragheb Al-Ghezi, Yaroslav Getman, Ekaterina Voskoboinik, Mittul Singh, M. Kurimo","doi":"10.1109/SLT54892.2023.10022381","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022381","url":null,"abstract":"Automatic spontaneous speaking assessment systems bring numerous advantages to second language (L2) learning and assessment such as promoting self-learning and reducing language teachers' workload. Conventionally, these systems are developed for languages with a large number of learners due to the abundance of training data, yet languages with fewer learners such as Finnish and Swedish remain at a disadvantage due to the scarcity of required training data. Nevertheless, recent advancements in self-supervised deep learning make it possible to develop automatic speech recognition systems with a reasonable amount of training data. In turn, this advancement makes it feasible to develop systems for automatically assessing spoken proficiency of learners of underresourced languages: L2 Finnish and Finland Swedish. Our work evaluates the overall performance of the L2 ASR systems as well as the the rating systems compared to human reference ratings for both languages.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121670559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
On Granularity of Prosodic Representations in Expressive Text-to-Speech 文本-语音表达中韵律表征的粒度研究
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022793
Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafał Sienkiewicz, Daniel Korzekwa, V. Klimkov
{"title":"On Granularity of Prosodic Representations in Expressive Text-to-Speech","authors":"Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafał Sienkiewicz, Daniel Korzekwa, V. Klimkov","doi":"10.1109/SLT54892.2023.10022793","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022793","url":null,"abstract":"In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonetic input and simplify that mapping. This paper compares prosodic embeddings at different levels of granularity and examines their prediction from text. We show that utterance-level embeddings have insufficient capacity and phoneme-level tend to introduce instabilities when predicted from text. Word-level representations impose balance between capacity and predictability. As a result, we close the gap in naturalness by 90% between synthetic speech and recordings on LibriTTS dataset, without sacrificing intelligibility.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125634508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient Text Analysis with Pre-Trained Neural Network Models 有效的文本分析与预训练的神经网络模型
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022565
Jia Cui, Heng Lu, Wen Wang, Shiyin Kang, Liqiang He, Guangzhi Li, Dong Yu
{"title":"Efficient Text Analysis with Pre-Trained Neural Network Models","authors":"Jia Cui, Heng Lu, Wen Wang, Shiyin Kang, Liqiang He, Guangzhi Li, Dong Yu","doi":"10.1109/SLT54892.2023.10022565","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022565","url":null,"abstract":"This paper investigates the application of pre-trained BERT model in three classic text analysis tasks: Chinese grapheme-to-phoneme(G2P), text normalization(TN) and sentence punctuation annotation. Even though the full-sized BERT has prominent modeling power, there are two challenges for it in real applications: the requirement for annotated training data and the considerable computational cost. In this paper, we propose BERT-based low-latency solutions. To collect sufficient training corpus for G2P, we transfer knowledge from existing rule-based system to BERT through a large amount of unlabeled corpus. The new model could convert all characters directly from raw texts with higher accuracy. We also propose a hybrid two-stage text normalization pipeline which reduces the sentence error rate by 25% compared to the rule-based system. We offer both supervised and weakly supervised versions and find that the latter has only 1% accuracy drop from the former.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133456470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Phone-Level Pronunciation Scoring for L1 Using Weighted-Dynamic Time Warping 基于加权动态时间扭曲的英语语音评分
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023182
A. Sini, Antoine Perquin, Damien Lolive, Arnaud Delhay
{"title":"Phone-Level Pronunciation Scoring for L1 Using Weighted-Dynamic Time Warping","authors":"A. Sini, Antoine Perquin, Damien Lolive, Arnaud Delhay","doi":"10.1109/SLT54892.2023.10023182","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023182","url":null,"abstract":"This paper presents a novel approach for phone-level pronunciation scoring. The proposed method relies on the two usual stages of pronunciation scoring: an acoustic model transcribes the spoken utterance into a phoneme sequence and then, Weighted-Dynamic Time Warping (W-DTW) is used to compare the predicted phoneme sequence against the reference one. Our approach alters the comparison process by considering Phonetic PosteriorGrams (PPG) rather than only the most probable sequence of phonemes. This led us to propose a modified W-DTW algorithm that considers the probabilities of the predicted phonemes, as well as the use of articulatory features as a proxy of phonetic similarity. The results achieved are satisfactory considering the content of the adult speech database and are comparable to well-known state-of-the-art methods.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114264262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interdecoder: using Attention Decoders as Intermediate Regularization for CTC-Based Speech Recognition 间解码器:使用注意解码器作为基于ctc的语音识别的中间正则化
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022760
Tatsuya Komatsu, Yusuke Fujita
{"title":"Interdecoder: using Attention Decoders as Intermediate Regularization for CTC-Based Speech Recognition","authors":"Tatsuya Komatsu, Yusuke Fujita","doi":"10.1109/SLT54892.2023.10022760","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022760","url":null,"abstract":"We propose InterDecoder: a new non-autoregressive automatic speech recognition (NAR-ASR) training method that injects the advantage of token-wise autoregressive decoders while keeping the efficient non-autoregressive inference. The NAR-ASR models are often less accurate than autoregressive models such as Transformer decoder, which predict tokens conditioned on previously predicted tokens. The Inter-Decoder regularizes training by feeding intermediate encoder outputs into the decoder to compute the token-level prediction errors given previous ground-truth tokens, whereas the widely used Hybrid CTC/Attention model uses the decoder loss only at the final layer. In combination with Self-conditioned CTC, which uses the Intermediate CTC predictions to condition the encoder, performance is further improved. Experiments on the Librispeech and Tedlium2 dataset show that the proposed method shows a relative 6% WER improvement at the maximum compared to the conventional NAR-ASR methods.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124244986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Hybrid Acoustic Echo Reduction Approach Using Kalman Filtering and Informed Source Extraction with Improved Training 基于改进训练的卡尔曼滤波和信息源提取混合声学回波抑制方法
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10023206
Wolfgang Mack, Emanuël Habets
{"title":"A Hybrid Acoustic Echo Reduction Approach Using Kalman Filtering and Informed Source Extraction with Improved Training","authors":"Wolfgang Mack, Emanuël Habets","doi":"10.1109/SLT54892.2023.10023206","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023206","url":null,"abstract":"State-of-the-art acoustic echo and noise reduction combines adaptive filters with a deep neural network-based postfilter. While the signal-to-distortion ratio is often used for training, it is not well-defined for all echo-reduction scenarios. We propose well-defined loss functions for training and modifications of a recently proposed echo reduction system that is based on informed source extraction. The modifications include using a Kalman filter as a prefilter and a cyclical learning rate scheduler. The proposed modifications improve the performance on the blind test set of the Interspeech 2021 AEC challenge. A comparison to the challenge-winner shows that the proposed system underperforms the winner by 0.1 mean opinion score (MOS) points in double-talk echo reduction. However, it outperforms the winner by 0.3 MOS points in echo-only echo reduction. In all other scenarios, both algorithms perform comparably.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125138753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploration of Language-Specific Self-Attention Parameters for Multilingual End-to-End Speech Recognition 多语言端到端语音识别中语言特异性自注意参数的探索
2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2023-01-09 DOI: 10.1109/SLT54892.2023.10022937
Brady C. Houston, K. Kirchhoff
{"title":"Exploration of Language-Specific Self-Attention Parameters for Multilingual End-to-End Speech Recognition","authors":"Brady C. Houston, K. Kirchhoff","doi":"10.1109/SLT54892.2023.10022937","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022937","url":null,"abstract":"In the last several years, end-to-end (E2E) ASR models have mostly surpassed the performance of hybrid ASR models. E2E is particularly well suited to multilingual approaches because it doesn't require language-specific phone alignments for training. Recent work has improved multilingual E2E modeling over naive data pooling on up to several dozen languages by using both language-specific and language-universal model parameters, as well as providing information about the language being presented to the network. Complementary to previous work we analyze language-specific parameters in the attention mechanism of Conformer-based encoder models. We show that using language-specific parameters in the attention mechanism can improve performance across six languages by up to 12% compared to standard multilingual baselines and up to 36% compared to monolingual baselines, without requiring any additional parameters during monolingual inference nor fine-tuning.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125548551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信