Benjamin Kleiner, Jack G. M. FitzGerald, Haidar Khan, Gohkan Tur
{"title":"Mixture of Domain Experts for Language Understanding: an Analysis of Modularity, Task Performance, and Memory Tradeoffs","authors":"Benjamin Kleiner, Jack G. M. FitzGerald, Haidar Khan, Gohkan Tur","doi":"10.1109/SLT54892.2023.10022866","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022866","url":null,"abstract":"One of the limitations of large-scale machine learning models is that they are difficult to adjust after deployment without significant re-training costs. In this paper, we focus on NLU and the needs of virtual assistant systems to continually update themselves through time to support new functionality. Specifically, we consider the tasks of intent classification (IC) and slot filling (SF), which are fundamental to processing user interaction with virtual assistants. We studied six different architectures with varying degrees of modularity in order to gain insights into the performance implications of designing models for flexible updates through time. Our experiments on the SLURP dataset, modified to simulate the real-world experience of adding new intents over time, show that a single dense model yields 2.5 — 3.5 points of average improvement versus individual domain models, but suffers a median degradation of 0.4 — 1.1 points as the new intents are incorporated. We present a mixture-of-experts based hybrid system that performs within 2.1 points of the dense model in exact match accuracy while either improving median performance for untouched domains through time or only degrading by 0.1 points at worst.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132859080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Luxembourgish Speech Recognition with Cross-Lingual Speech Representations","authors":"Le-Minh Nguyen, Shekhar Nayak, M. Coler","doi":"10.1109/SLT54892.2023.10022706","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022706","url":null,"abstract":"Luxembourgish is a West Germanic language spoken by roughly 390,000 people, mainly in Luxembourg. It is one of Europe's under-described and under-resourced languages, not extensively investigated in the context of speech recognition. We explore the self-supervised multilingual learning of Luxembourgish speech representations for the speech recognition downstream task. We show that learning cross-lingual representations is essential for low-resourced languages such as Luxembourgish. Learning cross-lingual representations and rescoring the output transcriptions with language modelling while using only 4 hours of labelled speech achieves a word error rate of 15.1% and improves our Transfer Learning baseline model relatively by 33.1% and absolutely by 7.5%. Increasing the amount of labelled speech to 14 hours yields a significant performance gain resulting in a 9.3% word error rate.11Models and datasets are available at https://hugging£ace.co/lemswasabi","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132322966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating the Important Temporal Modulations for Deep-Learning-Based Speech Activity Detection","authors":"Tyler Vuong, Nikhil Madaan, Rohan Panda, R. Stern","doi":"10.1109/SLT54892.2023.10022462","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022462","url":null,"abstract":"We describe a learnable modulation spectrogram feature for speech activity detection (SAD). Modulation features capture the temporal dynamics of each frequency subband. We compute learnable modulation spectrogram features by first calculating the log-mel spectrogram. Next, we filter each frequency subband with a bandpass filter that contains a learnable center frequency. The resulting SAD system was evaluated on the Fearless Steps Phase-04 SAD challenge. Experimental results showed that temporal modulations around the 4–6 Hz range are crucial for deep-learning-based SAD. These experimental results align with previous studies that found slow temporal modulation to be most important for speech-processing tasks and speech intelligibility. Additionally, we found that the learnable modulation spectrogram feature outperforms both the standard log-mel and fixed modulation spectrogram features on the Fearless Steps Phase-04 SAD test set.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130843529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ragheb Al-Ghezi, Yaroslav Getman, Ekaterina Voskoboinik, Mittul Singh, M. Kurimo
{"title":"Automatic Rating of Spontaneous Speech for Low-Resource Languages","authors":"Ragheb Al-Ghezi, Yaroslav Getman, Ekaterina Voskoboinik, Mittul Singh, M. Kurimo","doi":"10.1109/SLT54892.2023.10022381","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022381","url":null,"abstract":"Automatic spontaneous speaking assessment systems bring numerous advantages to second language (L2) learning and assessment such as promoting self-learning and reducing language teachers' workload. Conventionally, these systems are developed for languages with a large number of learners due to the abundance of training data, yet languages with fewer learners such as Finnish and Swedish remain at a disadvantage due to the scarcity of required training data. Nevertheless, recent advancements in self-supervised deep learning make it possible to develop automatic speech recognition systems with a reasonable amount of training data. In turn, this advancement makes it feasible to develop systems for automatically assessing spoken proficiency of learners of underresourced languages: L2 Finnish and Finland Swedish. Our work evaluates the overall performance of the L2 ASR systems as well as the the rating systems compared to human reference ratings for both languages.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121670559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafał Sienkiewicz, Daniel Korzekwa, V. Klimkov
{"title":"On Granularity of Prosodic Representations in Expressive Text-to-Speech","authors":"Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafał Sienkiewicz, Daniel Korzekwa, V. Klimkov","doi":"10.1109/SLT54892.2023.10022793","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022793","url":null,"abstract":"In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonetic input and simplify that mapping. This paper compares prosodic embeddings at different levels of granularity and examines their prediction from text. We show that utterance-level embeddings have insufficient capacity and phoneme-level tend to introduce instabilities when predicted from text. Word-level representations impose balance between capacity and predictability. As a result, we close the gap in naturalness by 90% between synthetic speech and recordings on LibriTTS dataset, without sacrificing intelligibility.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125634508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Text Analysis with Pre-Trained Neural Network Models","authors":"Jia Cui, Heng Lu, Wen Wang, Shiyin Kang, Liqiang He, Guangzhi Li, Dong Yu","doi":"10.1109/SLT54892.2023.10022565","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022565","url":null,"abstract":"This paper investigates the application of pre-trained BERT model in three classic text analysis tasks: Chinese grapheme-to-phoneme(G2P), text normalization(TN) and sentence punctuation annotation. Even though the full-sized BERT has prominent modeling power, there are two challenges for it in real applications: the requirement for annotated training data and the considerable computational cost. In this paper, we propose BERT-based low-latency solutions. To collect sufficient training corpus for G2P, we transfer knowledge from existing rule-based system to BERT through a large amount of unlabeled corpus. The new model could convert all characters directly from raw texts with higher accuracy. We also propose a hybrid two-stage text normalization pipeline which reduces the sentence error rate by 25% compared to the rule-based system. We offer both supervised and weakly supervised versions and find that the latter has only 1% accuracy drop from the former.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133456470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploration of Language-Specific Self-Attention Parameters for Multilingual End-to-End Speech Recognition","authors":"Brady C. Houston, K. Kirchhoff","doi":"10.1109/SLT54892.2023.10022937","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022937","url":null,"abstract":"In the last several years, end-to-end (E2E) ASR models have mostly surpassed the performance of hybrid ASR models. E2E is particularly well suited to multilingual approaches because it doesn't require language-specific phone alignments for training. Recent work has improved multilingual E2E modeling over naive data pooling on up to several dozen languages by using both language-specific and language-universal model parameters, as well as providing information about the language being presented to the network. Complementary to previous work we analyze language-specific parameters in the attention mechanism of Conformer-based encoder models. We show that using language-specific parameters in the attention mechanism can improve performance across six languages by up to 12% compared to standard multilingual baselines and up to 36% compared to monolingual baselines, without requiring any additional parameters during monolingual inference nor fine-tuning.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125548551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Qiu, Tsendsuren Munkhdalai, Yanzhang He, K. Sim
{"title":"Context-Aware Neural Confidence Estimation for Rare Word Speech Recognition","authors":"David Qiu, Tsendsuren Munkhdalai, Yanzhang He, K. Sim","doi":"10.1109/SLT54892.2023.10023411","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023411","url":null,"abstract":"Confidence estimation for automatic speech recognition (ASR) is important for many downstream tasks. Recently, neural confidence estimation models (CEMs) have been shown to produce accurate confidence scores for predicting word-level errors. These models are built on top of an end-to-end (E2E) ASR and the acoustic embeddings are part of the input features. However, practical E2E ASR systems often incorporate contextual information in the decoder to improve rare word recognition. The CEM is not aware of this and underestimates the confidence of the rare words that have been corrected by the context. In this paper, we propose a context-aware CEM by incorporating context into the encoder using a neural associative memory (NAM) model. It uses attention to detect for presence of the biasing phrases and modify the encoder features. Experiments show that the proposed context-aware CEM using NAM augmented training can improve the AUC-ROC for word error prediction from 0.837 to 0.892.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116154827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Hybrid Acoustic Echo Reduction Approach Using Kalman Filtering and Informed Source Extraction with Improved Training","authors":"Wolfgang Mack, Emanuël Habets","doi":"10.1109/SLT54892.2023.10023206","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023206","url":null,"abstract":"State-of-the-art acoustic echo and noise reduction combines adaptive filters with a deep neural network-based postfilter. While the signal-to-distortion ratio is often used for training, it is not well-defined for all echo-reduction scenarios. We propose well-defined loss functions for training and modifications of a recently proposed echo reduction system that is based on informed source extraction. The modifications include using a Kalman filter as a prefilter and a cyclical learning rate scheduler. The proposed modifications improve the performance on the blind test set of the Interspeech 2021 AEC challenge. A comparison to the challenge-winner shows that the proposed system underperforms the winner by 0.1 mean opinion score (MOS) points in double-talk echo reduction. However, it outperforms the winner by 0.3 MOS points in echo-only echo reduction. In all other scenarios, both algorithms perform comparably.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125138753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}