{"title":"Improving Noise Robustness for Spoken Content Retrieval Using Semi-Supervised ASR and N-Best Transcripts for BERT-Based Ranking Models","authors":"Yasufumi Moriya, Gareth J. F. Jones","doi":"10.1109/SLT54892.2023.10023197","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023197","url":null,"abstract":"BERT-based re-ranking and dense retrieval (DR) systems have been shown to improve search effectiveness for spoken content retrieval (SCR). However, both methods can still show a reduction in effectiveness when using ASR transcripts in comparison to accurate manual transcripts. We find that a known-item search task on the How2 dataset of spoken instruction videos shows a reduction in mean reciprocal rank (MRR) scores of 10-14%. As a potential method to reduce this disparity, we investigate the use of semi-supervised ASR transcripts and N-best ASR transcripts to mitigate ASR errors for spoken search using BERT-based ranking. Semi-supervised ASR transcripts brought 2-5.5% MRR improvements over standard ASR transcripts and our N-best early fusion methods for BERT DR systems improved MRR by 3-4%. Combining semi-supervised transcripts with N-best early fusion for BERT DR reduced the MRR gap in search effectiveness between manual and ASR transcripts by more than 50% from 14.32% to 6.58%.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133401001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fine Grained Spoken Document Summarization Through Text Segmentation","authors":"Samantha Kotey, Rozenn Dahyot, N. Harte","doi":"10.1109/SLT54892.2023.10022829","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022829","url":null,"abstract":"Podcast transcripts are long spoken documents of conversational dialogue. Challenging to summarize, podcasts cover a diverse range of topics, vary in length, and have uniquely different linguistic styles. Previous studies in podcast summarization have generated short, concise dialogue summaries. In contrast, we propose a method to generate long fine-grained summaries, which describe details of sub-topic narratives. Leveraging a readability formula, we curate a data subset to train a long sequence transformer for abstractive summarization. Through text segmentation, we filter the evaluation data and exclude specific segments of text. We apply the model to segmented data, producing different types of fine grained summaries. We show that appropriate filtering creates comparable results on ROUGE and serves as an alternative method to truncation. Experiments show our model outperforms previous studies on the Spotify podcast dataset when tasked with generating longer sequences of text.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122505702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Mavandadi, Bo Li, Chaoyang Zhang, B. Farris, Tara N. Sainath, Trevor Strohman
{"title":"A Truly Multilingual First Pass and Monolingual Second Pass Streaming on-Device ASR System","authors":"S. Mavandadi, Bo Li, Chaoyang Zhang, B. Farris, Tara N. Sainath, Trevor Strohman","doi":"10.1109/SLT54892.2023.10023346","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023346","url":null,"abstract":"Automatic speech recognition (ASR) systems need to be accurate, have low latency, and effectively handle language switching in order to be useful for the 60% of the world population that speaks more than one language. Thus, we propose a truly multilingual first-pass and monolingual second-pass streaming on-device ASR system based on the recently developed Cascaded Encoders model. The streaming first-pass recognizes multilingual speech without needing language information, providing real-time transcription, even for code-switching speech. The second-pass uses a language dependent right context encoder to improve the recognition accuracy. On a 9 language Voice Search task, we find that a system combining shared causal encoder with decoders and non-causal encoders replicated per-language reduces word error rate (WER) by 4.4% relative to monolingual baselines. We further show this design to be parameter efficient, outperforming other architectures when matched in the number of parameters.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117286677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring a Unified ASR for Multiple South Indian Languages Leveraging Multilingual Acoustic and Language Models","authors":"C. Anoop, A. Ramakrishnan","doi":"10.1109/SLT54892.2023.10022380","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022380","url":null,"abstract":"We build a single automatic speech recognition (ASR) model for several south Indian languages using a common set of intermediary labels, which can be easily mapped to the desired native script through simple lookup tables and a few rules. We use Sanskrit Library Phonetic encoding as the labeling scheme, which exploits the similarity in pronunciation across character sets of multiple Indian languages. Unlike the general approaches, which leverage common label sets only for multilingual acoustic modeling, we also explore multilingual language modeling. Our unified model improves the ASR performance in languages with limited amounts of speech data and also in out-of-domain test conditions. Also, the model performs reasonably well in languages with good representation in the training data.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131754967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Untied Positional Encodings for Efficient Transformer-Based Speech Recognition","authors":"Lahiru Samarakoon, Ivan Fung","doi":"10.1109/SLT54892.2023.10023097","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023097","url":null,"abstract":"Self-attention has become a vital component for end-to-end (E2E) automatic speech recognition (ASR). Convolution-augmented Transformer (Conformer) with relative positional encoding (RPE) achieved state-of-the-art performance. This paper proposes a positional encoding (PE) mechanism called Scaled Untied RPE that unties the feature-position correlations in the self-attention computation, and computes feature correlations and positional correlations separately using different projection matrices. In addition, we propose to scale feature correlations with the positional correlations and the aggressiveness of this multiplicative interaction can be configured using a parameter called amplitude. Moreover, we show that the PE matrix can be sliced to reduce model parameters. Our results on National Speech Corpus (NSC) show that Transformer encoders with Scaled Untied RPE achieves relative improvements of 1.9% in accuracy and up to 50.9% in latency over a Conformer baseline respectively.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116941493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saket Dingliwal, Monica Sunkara, S. Ronanki, Jeffrey J. Farris, K. Kirchhoff, S. Bodapati
{"title":"Personalization of CTC Speech Recognition Models","authors":"Saket Dingliwal, Monica Sunkara, S. Ronanki, Jeffrey J. Farris, K. Kirchhoff, S. Bodapati","doi":"10.1109/SLT54892.2023.10022705","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022705","url":null,"abstract":"End-to-end speech recognition models trained using joint Connectionist Temporal Classification (CTC)-Attention loss have gained popularity recently. In these models, a non-autoregressive CTC decoder is often used at inference time due to its speed and simplicity. However, such models are hard to personalize because of their conditional independence assumption that prevents output tokens from previous time steps to influence future predictions. To tackle this, we propose a novel two-way approach that first biases the encoder with attention over a predefined list of rare long-tail and out-of-vocabulary (OOV) words and then uses dynamic boosting and phone alignment network during decoding to further bias the subword pre-dictions. We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words over a strong CTC baseline.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122413886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adam Stooke, K. Sim, Mason Chua, Tsendsuren Munkhdalai, Trevor Strohman
{"title":"Internal Language Model Personalization of E2E Automatic Speech Recognition Using Random Encoder Features","authors":"Adam Stooke, K. Sim, Mason Chua, Tsendsuren Munkhdalai, Trevor Strohman","doi":"10.1109/SLT54892.2023.10022938","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10022938","url":null,"abstract":"End-to-end (E2E) speech-to-text models generally require transcribed audio for training and personalization. We introduce the use of random audio encoder features, rather than speech, to fine-tune the final model layers and acquire new vocabulary from text-only data. This technique can be used for on-device personalization before the user has provided any speech data. We show improvements in the recall of new vocabulary and word error rate (WER) on held-out test sets using simulated user experiments on hybrid autoregressive transducer (HAT) models using conformer-based encoders and simple text embeddings for label processing. We compare this approach to the use of synthetic audio, finding random encoder features to be more beneficial with lower computational cost. Experiments show that the maximum benefit is gained by updating specific network components comprising a subset of those expressing the internal language model.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127485123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhengyang Li, Timo Lohrenz, Matthias Dunkelberg, T. Fingscheidt
{"title":"Transformer-Based Lip-Reading with Regularized Dropout and Relaxed Attention","authors":"Zhengyang Li, Timo Lohrenz, Matthias Dunkelberg, T. Fingscheidt","doi":"10.1109/SLT54892.2023.10023442","DOIUrl":"https://doi.org/10.1109/SLT54892.2023.10023442","url":null,"abstract":"End-to-end automatic lip-reading usually comprises an encoder-decoder model and an optional external language model. In this work, we introduce two regularization methods to the field of lip-reading: First, we apply the regularized dropout (R-Drop) method to transformer-based lip-reading to improve their training-inference consistency. Second, the relaxed attention technique is applied during training for a better external language model integration. We are the first to show that these two complementary approaches yield particu1arly strong performance if combined in the right manner. In particular, by adding an additional R - Drop loss and smoothing the attention weights in cross multi-head attention during training only, we achieve a new state of the art with a word error rate of 22.2% on Lip Reading Sentences 2 (LRS2). On LRS3, we are 2nd ranked with 25.5% WER using only 1,759 h of training data, while the 1 st rank uses about 90,000 h. Our code is available at GitHub.11https://github.com/ifnspaml/Lipreading-RDrop-RA","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128091595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}