{"title":"Continual End-to-End Speech-to-Text translation using augmented bi-sampler","authors":"Balaram Sarkar, Pranav Karande, Ankit Malviya, Chandresh Kumar Maurya","doi":"10.1016/j.csl.2025.101885","DOIUrl":null,"url":null,"abstract":"<div><div>Speech-to-Text (ST) is the translation of speech in one language to text in another language. Earlier models for ST used a pipeline approach combining automatic speech recognition (ASR) and machine translation (MT). Such models suffer from cascade error propagation, high latency and memory consumption. Therefore, End-to-End (E2E) ST models were proposed. Adapting E2E ST models to new language pairs results in deterioration of performance on the previously trained language pairs. This phenomenon is called Catastrophic Forgetting (CF). Therefore, we need ST models that can learn continually. The present work proposes a novel continual learning (CL) framework for E2E ST tasks. The core idea behind our approach combines proportional-language sampling (PLS), random sampling (RS), and augmentation. RS helps in performing well on the current task by sampling aggressively from it. PLS is used to sample equal proportion from past task data but it may cause over-fitting. To mitigate that, a combined approach of PLS+RS is used, dubbed as continual bi-sampler (CBS). However, CBS still suffers from over-fitting due to repeated samples from the past tasks. Therefore, we apply various augmentation strategies combined with CBS which we call continual augmented bi-sampler (CABS). We perform experiments on 4 language pairs of MuST-C (One to Many) and mTEDx (Many to Many) datasets and achieve a gain of <strong>68.38%</strong> and <strong>41%</strong> respectively in the average BLEU score compared to baselines. CABS also mitigates the average forgetting by <strong>82.2%</strong> in MuST-C dataset compared to the Gradient Episodic Memory (GEM) baseline. The results show that the proposed CL based E2E ST ensures knowledge retention across previously trained languages. To the best of our knowledge, E2E ST model has not been studied before in a CL setup.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101885"},"PeriodicalIF":3.4000,"publicationDate":"2025-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S088523082500110X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Speech-to-Text (ST) is the translation of speech in one language to text in another language. Earlier models for ST used a pipeline approach combining automatic speech recognition (ASR) and machine translation (MT). Such models suffer from cascade error propagation, high latency and memory consumption. Therefore, End-to-End (E2E) ST models were proposed. Adapting E2E ST models to new language pairs results in deterioration of performance on the previously trained language pairs. This phenomenon is called Catastrophic Forgetting (CF). Therefore, we need ST models that can learn continually. The present work proposes a novel continual learning (CL) framework for E2E ST tasks. The core idea behind our approach combines proportional-language sampling (PLS), random sampling (RS), and augmentation. RS helps in performing well on the current task by sampling aggressively from it. PLS is used to sample equal proportion from past task data but it may cause over-fitting. To mitigate that, a combined approach of PLS+RS is used, dubbed as continual bi-sampler (CBS). However, CBS still suffers from over-fitting due to repeated samples from the past tasks. Therefore, we apply various augmentation strategies combined with CBS which we call continual augmented bi-sampler (CABS). We perform experiments on 4 language pairs of MuST-C (One to Many) and mTEDx (Many to Many) datasets and achieve a gain of 68.38% and 41% respectively in the average BLEU score compared to baselines. CABS also mitigates the average forgetting by 82.2% in MuST-C dataset compared to the Gradient Episodic Memory (GEM) baseline. The results show that the proposed CL based E2E ST ensures knowledge retention across previously trained languages. To the best of our knowledge, E2E ST model has not been studied before in a CL setup.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.