{"title":"Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility","authors":"Xiaoyu Liu, Xu Li, Joan Serrà, Santiago Pascual","doi":"arxiv-2409.09357","DOIUrl":null,"url":null,"abstract":"Speech restoration aims at restoring full-band speech with high quality and\nintelligibility, considering a diverse set of distortions. MaskSR is a recently\nproposed generative model for this task. As other models of its kind, MaskSR\nattains high quality but, as we show, intelligibility can be substantially\nimproved. We do so by boosting the speech encoder component of MaskSR with\npredictions of semantic representations of the target speech, using a\npre-trained self-supervised teacher model. Then, a masked language model is\nconditioned on the learned semantic features to predict acoustic tokens that\nencode low level spectral details of the target speech. We show that, with the\nsame MaskSR model capacity and inference time, the proposed model, MaskSR2,\nsignificantly reduces the word error rate, a typical metric for\nintelligibility. MaskSR2 also achieves competitive word error rate among other\nmodels, while providing superior quality. An ablation study shows the\neffectiveness of various semantic representations.","PeriodicalId":501034,"journal":{"name":"arXiv - EE - Signal Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09357","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Speech restoration aims at restoring full-band speech with high quality and
intelligibility, considering a diverse set of distortions. MaskSR is a recently
proposed generative model for this task. As other models of its kind, MaskSR
attains high quality but, as we show, intelligibility can be substantially
improved. We do so by boosting the speech encoder component of MaskSR with
predictions of semantic representations of the target speech, using a
pre-trained self-supervised teacher model. Then, a masked language model is
conditioned on the learned semantic features to predict acoustic tokens that
encode low level spectral details of the target speech. We show that, with the
same MaskSR model capacity and inference time, the proposed model, MaskSR2,
significantly reduces the word error rate, a typical metric for
intelligibility. MaskSR2 also achieves competitive word error rate among other
models, while providing superior quality. An ablation study shows the
effectiveness of various semantic representations.