联合语义知识提炼和掩蔽声学建模,实现具有更高可懂度的全频段语音修复

Xiaoyu Liu, Xu Li, Joan Serrà, Santiago Pascual
{"title":"联合语义知识提炼和掩蔽声学建模,实现具有更高可懂度的全频段语音修复","authors":"Xiaoyu Liu, Xu Li, Joan Serrà, Santiago Pascual","doi":"arxiv-2409.09357","DOIUrl":null,"url":null,"abstract":"Speech restoration aims at restoring full-band speech with high quality and\nintelligibility, considering a diverse set of distortions. MaskSR is a recently\nproposed generative model for this task. As other models of its kind, MaskSR\nattains high quality but, as we show, intelligibility can be substantially\nimproved. We do so by boosting the speech encoder component of MaskSR with\npredictions of semantic representations of the target speech, using a\npre-trained self-supervised teacher model. Then, a masked language model is\nconditioned on the learned semantic features to predict acoustic tokens that\nencode low level spectral details of the target speech. We show that, with the\nsame MaskSR model capacity and inference time, the proposed model, MaskSR2,\nsignificantly reduces the word error rate, a typical metric for\nintelligibility. MaskSR2 also achieves competitive word error rate among other\nmodels, while providing superior quality. An ablation study shows the\neffectiveness of various semantic representations.","PeriodicalId":501034,"journal":{"name":"arXiv - EE - Signal Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility\",\"authors\":\"Xiaoyu Liu, Xu Li, Joan Serrà, Santiago Pascual\",\"doi\":\"arxiv-2409.09357\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech restoration aims at restoring full-band speech with high quality and\\nintelligibility, considering a diverse set of distortions. MaskSR is a recently\\nproposed generative model for this task. As other models of its kind, MaskSR\\nattains high quality but, as we show, intelligibility can be substantially\\nimproved. We do so by boosting the speech encoder component of MaskSR with\\npredictions of semantic representations of the target speech, using a\\npre-trained self-supervised teacher model. Then, a masked language model is\\nconditioned on the learned semantic features to predict acoustic tokens that\\nencode low level spectral details of the target speech. We show that, with the\\nsame MaskSR model capacity and inference time, the proposed model, MaskSR2,\\nsignificantly reduces the word error rate, a typical metric for\\nintelligibility. MaskSR2 also achieves competitive word error rate among other\\nmodels, while providing superior quality. An ablation study shows the\\neffectiveness of various semantic representations.\",\"PeriodicalId\":501034,\"journal\":{\"name\":\"arXiv - EE - Signal Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Signal Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09357\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09357","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

语音修复的目的是在考虑各种失真的情况下,恢复高质量和可理解的全频段语音。MaskSR 是最近针对这一任务提出的生成模型。与其他同类模型一样,MaskSR 可获得高质量,但正如我们所展示的,其可懂度也可大幅提高。为此,我们使用预先训练好的自监督教师模型,通过预测目标语音的语义表征来增强 MaskSR 的语音编码器部分。然后,以学习到的语义特征为条件建立掩码语言模型,预测编码目标语音低级频谱细节的声学标记。我们的研究表明,在 MaskSR 模型容量和推理时间相同的情况下,所提出的模型 MaskSR2 显著降低了单词错误率,而单词错误率是衡量语音可理解性的典型指标。MaskSR2 在提供卓越质量的同时,还在其他模型中实现了具有竞争力的词错误率。一项消融研究显示了各种语义表征的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility
Speech restoration aims at restoring full-band speech with high quality and intelligibility, considering a diverse set of distortions. MaskSR is a recently proposed generative model for this task. As other models of its kind, MaskSR attains high quality but, as we show, intelligibility can be substantially improved. We do so by boosting the speech encoder component of MaskSR with predictions of semantic representations of the target speech, using a pre-trained self-supervised teacher model. Then, a masked language model is conditioned on the learned semantic features to predict acoustic tokens that encode low level spectral details of the target speech. We show that, with the same MaskSR model capacity and inference time, the proposed model, MaskSR2, significantly reduces the word error rate, a typical metric for intelligibility. MaskSR2 also achieves competitive word error rate among other models, while providing superior quality. An ablation study shows the effectiveness of various semantic representations.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信