Benjamin Kleiner, Jack G. M. FitzGerald, Haidar Khan, Gohkan Tur
{"title":"Mixture of Domain Experts for Language Understanding: an Analysis of Modularity, Task Performance, and Memory Tradeoffs","authors":"Benjamin Kleiner, Jack G. M. FitzGerald, Haidar Khan, Gohkan Tur","doi":"10.1109/SLT54892.2023.10022866","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022866","url":null,"abstract":"One of the limitations of large-scale machine learning models is that they are difficult to adjust after deployment without significant re-training costs. In this paper, we focus on NLU and the needs of virtual assistant systems to continually update themselves through time to support new functionality. Specifically, we consider the tasks of intent classification (IC) and slot filling (SF), which are fundamental to processing user interaction with virtual assistants. We studied six different architectures with varying degrees of modularity in order to gain insights into the performance implications of designing models for flexible updates through time. Our experiments on the SLURP dataset, modified to simulate the real-world experience of adding new intents over time, show that a single dense model yields 2.5 — 3.5 points of average improvement versus individual domain models, but suffers a median degradation of 0.4 — 1.1 points as the new intents are incorporated. We present a mixture-of-experts based hybrid system that performs within 2.1 points of the dense model in exact match accuracy while either improving median performance for untouched domains through time or only degrading by 0.1 points at worst.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132859080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Luxembourgish Speech Recognition with Cross-Lingual Speech Representations","authors":"Le-Minh Nguyen, Shekhar Nayak, M. Coler","doi":"10.1109/SLT54892.2023.10022706","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022706","url":null,"abstract":"Luxembourgish is a West Germanic language spoken by roughly 390,000 people, mainly in Luxembourg. It is one of Europe's under-described and under-resourced languages, not extensively investigated in the context of speech recognition. We explore the self-supervised multilingual learning of Luxembourgish speech representations for the speech recognition downstream task. We show that learning cross-lingual representations is essential for low-resourced languages such as Luxembourgish. Learning cross-lingual representations and rescoring the output transcriptions with language modelling while using only 4 hours of labelled speech achieves a word error rate of 15.1% and improves our Transfer Learning baseline model relatively by 33.1% and absolutely by 7.5%. Increasing the amount of labelled speech to 14 hours yields a significant performance gain resulting in a 9.3% word error rate.11Models and datasets are available at https://hugging£ace.co/lemswasabi","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132322966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating the Important Temporal Modulations for Deep-Learning-Based Speech Activity Detection","authors":"Tyler Vuong, Nikhil Madaan, Rohan Panda, R. Stern","doi":"10.1109/SLT54892.2023.10022462","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022462","url":null,"abstract":"We describe a learnable modulation spectrogram feature for speech activity detection (SAD). Modulation features capture the temporal dynamics of each frequency subband. We compute learnable modulation spectrogram features by first calculating the log-mel spectrogram. Next, we filter each frequency subband with a bandpass filter that contains a learnable center frequency. The resulting SAD system was evaluated on the Fearless Steps Phase-04 SAD challenge. Experimental results showed that temporal modulations around the 4–6 Hz range are crucial for deep-learning-based SAD. These experimental results align with previous studies that found slow temporal modulation to be most important for speech-processing tasks and speech intelligibility. Additionally, we found that the learnable modulation spectrogram feature outperforms both the standard log-mel and fixed modulation spectrogram features on the Fearless Steps Phase-04 SAD test set.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130843529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ragheb Al-Ghezi, Yaroslav Getman, Ekaterina Voskoboinik, Mittul Singh, M. Kurimo
{"title":"Automatic Rating of Spontaneous Speech for Low-Resource Languages","authors":"Ragheb Al-Ghezi, Yaroslav Getman, Ekaterina Voskoboinik, Mittul Singh, M. Kurimo","doi":"10.1109/SLT54892.2023.10022381","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022381","url":null,"abstract":"Automatic spontaneous speaking assessment systems bring numerous advantages to second language (L2) learning and assessment such as promoting self-learning and reducing language teachers' workload. Conventionally, these systems are developed for languages with a large number of learners due to the abundance of training data, yet languages with fewer learners such as Finnish and Swedish remain at a disadvantage due to the scarcity of required training data. Nevertheless, recent advancements in self-supervised deep learning make it possible to develop automatic speech recognition systems with a reasonable amount of training data. In turn, this advancement makes it feasible to develop systems for automatically assessing spoken proficiency of learners of underresourced languages: L2 Finnish and Finland Swedish. Our work evaluates the overall performance of the L2 ASR systems as well as the the rating systems compared to human reference ratings for both languages.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121670559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafał Sienkiewicz, Daniel Korzekwa, V. Klimkov
{"title":"On Granularity of Prosodic Representations in Expressive Text-to-Speech","authors":"Mikolaj Babianski, Kamil Pokora, Raahil Shah, Rafał Sienkiewicz, Daniel Korzekwa, V. Klimkov","doi":"10.1109/SLT54892.2023.10022793","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022793","url":null,"abstract":"In expressive speech synthesis it is widely adopted to use latent prosody representations to deal with variability of the data during training. Same text may correspond to various acoustic realizations, which is known as a one-to-many mapping problem in text-to-speech. Utterance, word, or phoneme-level representations are extracted from target signal in an auto-encoding setup, to complement phonetic input and simplify that mapping. This paper compares prosodic embeddings at different levels of granularity and examines their prediction from text. We show that utterance-level embeddings have insufficient capacity and phoneme-level tend to introduce instabilities when predicted from text. Word-level representations impose balance between capacity and predictability. As a result, we close the gap in naturalness by 90% between synthetic speech and recordings on LibriTTS dataset, without sacrificing intelligibility.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125634508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Text Analysis with Pre-Trained Neural Network Models","authors":"Jia Cui, Heng Lu, Wen Wang, Shiyin Kang, Liqiang He, Guangzhi Li, Dong Yu","doi":"10.1109/SLT54892.2023.10022565","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022565","url":null,"abstract":"This paper investigates the application of pre-trained BERT model in three classic text analysis tasks: Chinese grapheme-to-phoneme(G2P), text normalization(TN) and sentence punctuation annotation. Even though the full-sized BERT has prominent modeling power, there are two challenges for it in real applications: the requirement for annotated training data and the considerable computational cost. In this paper, we propose BERT-based low-latency solutions. To collect sufficient training corpus for G2P, we transfer knowledge from existing rule-based system to BERT through a large amount of unlabeled corpus. The new model could convert all characters directly from raw texts with higher accuracy. We also propose a hybrid two-stage text normalization pipeline which reduces the sentence error rate by 25% compared to the rule-based system. We offer both supervised and weakly supervised versions and find that the latter has only 1% accuracy drop from the former.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133456470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Sini, Antoine Perquin, Damien Lolive, Arnaud Delhay
{"title":"Phone-Level Pronunciation Scoring for L1 Using Weighted-Dynamic Time Warping","authors":"A. Sini, Antoine Perquin, Damien Lolive, Arnaud Delhay","doi":"10.1109/SLT54892.2023.10023182","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023182","url":null,"abstract":"This paper presents a novel approach for phone-level pronunciation scoring. The proposed method relies on the two usual stages of pronunciation scoring: an acoustic model transcribes the spoken utterance into a phoneme sequence and then, Weighted-Dynamic Time Warping (W-DTW) is used to compare the predicted phoneme sequence against the reference one. Our approach alters the comparison process by considering Phonetic PosteriorGrams (PPG) rather than only the most probable sequence of phonemes. This led us to propose a modified W-DTW algorithm that considers the probabilities of the predicted phonemes, as well as the use of articulatory features as a proxy of phonetic similarity. The results achieved are satisfactory considering the content of the adult speech database and are comparable to well-known state-of-the-art methods.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114264262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interdecoder: using Attention Decoders as Intermediate Regularization for CTC-Based Speech Recognition","authors":"Tatsuya Komatsu, Yusuke Fujita","doi":"10.1109/SLT54892.2023.10022760","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022760","url":null,"abstract":"We propose InterDecoder: a new non-autoregressive automatic speech recognition (NAR-ASR) training method that injects the advantage of token-wise autoregressive decoders while keeping the efficient non-autoregressive inference. The NAR-ASR models are often less accurate than autoregressive models such as Transformer decoder, which predict tokens conditioned on previously predicted tokens. The Inter-Decoder regularizes training by feeding intermediate encoder outputs into the decoder to compute the token-level prediction errors given previous ground-truth tokens, whereas the widely used Hybrid CTC/Attention model uses the decoder loss only at the final layer. In combination with Self-conditioned CTC, which uses the Intermediate CTC predictions to condition the encoder, performance is further improved. Experiments on the Librispeech and Tedlium2 dataset show that the proposed method shows a relative 6% WER improvement at the maximum compared to the conventional NAR-ASR methods.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124244986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Hybrid Acoustic Echo Reduction Approach Using Kalman Filtering and Informed Source Extraction with Improved Training","authors":"Wolfgang Mack, Emanuël Habets","doi":"10.1109/SLT54892.2023.10023206","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023206","url":null,"abstract":"State-of-the-art acoustic echo and noise reduction combines adaptive filters with a deep neural network-based postfilter. While the signal-to-distortion ratio is often used for training, it is not well-defined for all echo-reduction scenarios. We propose well-defined loss functions for training and modifications of a recently proposed echo reduction system that is based on informed source extraction. The modifications include using a Kalman filter as a prefilter and a cyclical learning rate scheduler. The proposed modifications improve the performance on the blind test set of the Interspeech 2021 AEC challenge. A comparison to the challenge-winner shows that the proposed system underperforms the winner by 0.1 mean opinion score (MOS) points in double-talk echo reduction. However, it outperforms the winner by 0.3 MOS points in echo-only echo reduction. In all other scenarios, both algorithms perform comparably.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125138753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploration of Language-Specific Self-Attention Parameters for Multilingual End-to-End Speech Recognition","authors":"Brady C. Houston, K. Kirchhoff","doi":"10.1109/SLT54892.2023.10022937","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022937","url":null,"abstract":"In the last several years, end-to-end (E2E) ASR models have mostly surpassed the performance of hybrid ASR models. E2E is particularly well suited to multilingual approaches because it doesn't require language-specific phone alignments for training. Recent work has improved multilingual E2E modeling over naive data pooling on up to several dozen languages by using both language-specific and language-universal model parameters, as well as providing information about the language being presented to the network. Complementary to previous work we analyze language-specific parameters in the attention mechanism of Conformer-based encoder models. We show that using language-specific parameters in the attention mechanism can improve performance across six languages by up to 12% compared to standard multilingual baselines and up to 36% compared to monolingual baselines, without requiring any additional parameters during monolingual inference nor fine-tuning.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125548551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}