{"title":"A speech prediction model based on codec modeling and transformer decoding","authors":"Heming Wang , Yufeng Yang , DeLiang Wang","doi":"10.1016/j.csl.2025.101892","DOIUrl":null,"url":null,"abstract":"<div><div>Speech prediction is essential for tasks like packet loss concealment and algorithmic delay compensation. This paper proposes a novel prediction algorithm that leverages a speech codec and transformer decoder to autoregressively predict missing frames. Unlike text-guided methods requiring auxiliary information, the proposed approach operates solely on speech for prediction. A comparative study is conducted to evaluate and compare the proposed and existing speech prediction methods on packet loss concealment (PLC) and frame-wise speech prediction tasks. Comprehensive experiments demonstrate that the proposed model achieves superior prediction results, which are substantially better than other state-of-the-art baselines, including on a recent PLC challenge. We also systematically examine factors influencing prediction performance, including context window lengths, prediction lengths, and training and inference strategies.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"96 ","pages":"Article 101892"},"PeriodicalIF":3.4000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825001172","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Speech prediction is essential for tasks like packet loss concealment and algorithmic delay compensation. This paper proposes a novel prediction algorithm that leverages a speech codec and transformer decoder to autoregressively predict missing frames. Unlike text-guided methods requiring auxiliary information, the proposed approach operates solely on speech for prediction. A comparative study is conducted to evaluate and compare the proposed and existing speech prediction methods on packet loss concealment (PLC) and frame-wise speech prediction tasks. Comprehensive experiments demonstrate that the proposed model achieves superior prediction results, which are substantially better than other state-of-the-art baselines, including on a recent PLC challenge. We also systematically examine factors influencing prediction performance, including context window lengths, prediction lengths, and training and inference strategies.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.