联合语义知识提炼和掩蔽声学建模，实现具有更高可懂度的全频段语音修复

arXiv - EE - Signal Processing Pub Date : 2024-09-14 DOI:arxiv-2409.09357

Xiaoyu Liu, Xu Li, Joan Serrà, Santiago Pascual

{"title":"联合语义知识提炼和掩蔽声学建模，实现具有更高可懂度的全频段语音修复","authors":"Xiaoyu Liu, Xu Li, Joan Serrà, Santiago Pascual","doi":"arxiv-2409.09357","DOIUrl":null,"url":null,"abstract":"Speech restoration aims at restoring full-band speech with high quality and\nintelligibility, considering a diverse set of distortions. MaskSR is a recently\nproposed generative model for this task. As other models of its kind, MaskSR\nattains high quality but, as we show, intelligibility can be substantially\nimproved. We do so by boosting the speech encoder component of MaskSR with\npredictions of semantic representations of the target speech, using a\npre-trained self-supervised teacher model. Then, a masked language model is\nconditioned on the learned semantic features to predict acoustic tokens that\nencode low level spectral details of the target speech. We show that, with the\nsame MaskSR model capacity and inference time, the proposed model, MaskSR2,\nsignificantly reduces the word error rate, a typical metric for\nintelligibility. MaskSR2 also achieves competitive word error rate among other\nmodels, while providing superior quality. An ablation study shows the\neffectiveness of various semantic representations.","PeriodicalId":501034,"journal":{"name":"arXiv - EE - Signal Processing","volume":"65 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility\",\"authors\":\"Xiaoyu Liu, Xu Li, Joan Serrà, Santiago Pascual\",\"doi\":\"arxiv-2409.09357\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech restoration aims at restoring full-band speech with high quality and\\nintelligibility, considering a diverse set of distortions. MaskSR is a recently\\nproposed generative model for this task. As other models of its kind, MaskSR\\nattains high quality but, as we show, intelligibility can be substantially\\nimproved. We do so by boosting the speech encoder component of MaskSR with\\npredictions of semantic representations of the target speech, using a\\npre-trained self-supervised teacher model. Then, a masked language model is\\nconditioned on the learned semantic features to predict acoustic tokens that\\nencode low level spectral details of the target speech. We show that, with the\\nsame MaskSR model capacity and inference time, the proposed model, MaskSR2,\\nsignificantly reduces the word error rate, a typical metric for\\nintelligibility. MaskSR2 also achieves competitive word error rate among other\\nmodels, while providing superior quality. An ablation study shows the\\neffectiveness of various semantic representations.\",\"PeriodicalId\":501034,\"journal\":{\"name\":\"arXiv - EE - Signal Processing\",\"volume\":\"65 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Signal Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09357\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09357","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

语音修复的目的是在考虑各种失真的情况下，恢复高质量和可理解的全频段语音。MaskSR 是最近针对这一任务提出的生成模型。与其他同类模型一样，MaskSR 可获得高质量，但正如我们所展示的，其可懂度也可大幅提高。为此，我们使用预先训练好的自监督教师模型，通过预测目标语音的语义表征来增强 MaskSR 的语音编码器部分。然后，以学习到的语义特征为条件建立掩码语言模型，预测编码目标语音低级频谱细节的声学标记。我们的研究表明，在 MaskSR 模型容量和推理时间相同的情况下，所提出的模型 MaskSR2 显著降低了单词错误率，而单词错误率是衡量语音可理解性的典型指标。MaskSR2 在提供卓越质量的同时，还在其他模型中实现了具有竞争力的词错误率。一项消融研究显示了各种语义表征的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility

Speech restoration aims at restoring full-band speech with high quality and intelligibility, considering a diverse set of distortions. MaskSR is a recently proposed generative model for this task. As other models of its kind, MaskSR attains high quality but, as we show, intelligibility can be substantially improved. We do so by boosting the speech encoder component of MaskSR with predictions of semantic representations of the target speech, using a pre-trained self-supervised teacher model. Then, a masked language model is conditioned on the learned semantic features to predict acoustic tokens that encode low level spectral details of the target speech. We show that, with the same MaskSR model capacity and inference time, the proposed model, MaskSR2, significantly reduces the word error rate, a typical metric for intelligibility. MaskSR2 also achieves competitive word error rate among other models, while providing superior quality. An ablation study shows the effectiveness of various semantic representations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - EE - Signal Processing

自引率

0.00%

发文量