{"title":"Initial decoding with minimally augmented language model for improved lattice rescoring in low resource ASR","authors":"Savitha Murthy, Dinkar Sitaram","doi":"10.1007/s12046-024-02520-0","DOIUrl":null,"url":null,"abstract":"<p>Automatic speech recognition systems for low-resource languages typically have smaller corpora on which the language model is trained. Decoding with such a language model leads to a high word error rate due to the large number of out-of-vocabulary words in the test data. Larger language models can be used to rescore the lattices generated from initial decoding. This approach, however, gives only a marginal improvement. Decoding with a larger augmented language model, though helpful, is memory intensive and not feasible for low resource system setup. The objective of our research is to perform initial decoding with a minimally augmented language model. The lattices thus generated are then rescored with a larger language model. We thus obtain a significant reduction in error for low-resource Indic languages, namely, Kannada and Telugu. This paper addresses the problem of improving speech recognition accuracy with lattice rescoring in low-resource languages where the baseline language model is not sufficient for generating inclusive lattices. We minimally augment the baseline language model with unigram counts of words that are present in a larger text corpus of the target language but absent in the baseline. The lattices generated after decoding with a minimally augmented baseline language model are more comprehensive for rescoring. We obtain 21.8% (for Telugu) and 41.8% (for Kannada) relative word error reduction with our proposed method. This reduction in word error rate is comparable to 21.5% (for Telugu) and 45.9% (for Kannada) relative word error reduction obtained by decoding with full Wikipedia text augmented language mode while our approach consumes only 1/8th the memory. We demonstrate that our method is comparable with various text selection-based language model augmentation and also consistent for data sets of different sizes. Our approach is applicable for training speech recognition systems under low resource conditions where speech data and compute resources are insufficient, while there is a large text corpus that is available in the target language. Our research involves addressing the issue of out-of-vocabulary words of the baseline in general and does not focus on resolving the absence of named entities. Our proposed method is simple and yet computationally less expensive.</p>","PeriodicalId":21498,"journal":{"name":"Sādhanā","volume":"58 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sādhanā","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s12046-024-02520-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Automatic speech recognition systems for low-resource languages typically have smaller corpora on which the language model is trained. Decoding with such a language model leads to a high word error rate due to the large number of out-of-vocabulary words in the test data. Larger language models can be used to rescore the lattices generated from initial decoding. This approach, however, gives only a marginal improvement. Decoding with a larger augmented language model, though helpful, is memory intensive and not feasible for low resource system setup. The objective of our research is to perform initial decoding with a minimally augmented language model. The lattices thus generated are then rescored with a larger language model. We thus obtain a significant reduction in error for low-resource Indic languages, namely, Kannada and Telugu. This paper addresses the problem of improving speech recognition accuracy with lattice rescoring in low-resource languages where the baseline language model is not sufficient for generating inclusive lattices. We minimally augment the baseline language model with unigram counts of words that are present in a larger text corpus of the target language but absent in the baseline. The lattices generated after decoding with a minimally augmented baseline language model are more comprehensive for rescoring. We obtain 21.8% (for Telugu) and 41.8% (for Kannada) relative word error reduction with our proposed method. This reduction in word error rate is comparable to 21.5% (for Telugu) and 45.9% (for Kannada) relative word error reduction obtained by decoding with full Wikipedia text augmented language mode while our approach consumes only 1/8th the memory. We demonstrate that our method is comparable with various text selection-based language model augmentation and also consistent for data sets of different sizes. Our approach is applicable for training speech recognition systems under low resource conditions where speech data and compute resources are insufficient, while there is a large text corpus that is available in the target language. Our research involves addressing the issue of out-of-vocabulary words of the baseline in general and does not focus on resolving the absence of named entities. Our proposed method is simple and yet computationally less expensive.