Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot Learning with Knowledge Distillation

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) Pub Date : 2021-05-08 DOI:10.1109/WASPAA52581.2021.9632771

Sunwoo Kim, Minje Kim

{"title":"Test-Time Adaptation Toward Personalized Speech Enhancement: Zero-Shot Learning with Knowledge Distillation","authors":"Sunwoo Kim, Minje Kim","doi":"10.1109/WASPAA52581.2021.9632771","DOIUrl":null,"url":null,"abstract":"In realistic speech enhancement settings for end-user devices, we often encounter only a few speakers and noise types that tend to reoccur in the specific acoustic environment. We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity. Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker, thus fulfilling the requirement for zero-shot learning. To complement the lack of clean speech, we employ the knowledge distillation framework: we distill the more advanced denoising results from an overly large teacher model, and use them as the pseudo target to train the small student model. This zero-shot learning procedure circumvents the process of collecting users' clean speech, a process that users are reluctant to comply due to privacy concerns and technical difficulty of recording clean voice. Experiments on various test-time conditions show that the proposed personalization method can significantly improve the compact models' performance during the test time. Furthermore, since the personalized models outperform larger non-personalized baseline models, we claim that personalization achieves model compression with no loss of denoising performance. As expected, the student models underperform the state-of-the-art teacher models.","PeriodicalId":429900,"journal":{"name":"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WASPAA52581.2021.9632771","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

In realistic speech enhancement settings for end-user devices, we often encounter only a few speakers and noise types that tend to reoccur in the specific acoustic environment. We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity. Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker, thus fulfilling the requirement for zero-shot learning. To complement the lack of clean speech, we employ the knowledge distillation framework: we distill the more advanced denoising results from an overly large teacher model, and use them as the pseudo target to train the small student model. This zero-shot learning procedure circumvents the process of collecting users' clean speech, a process that users are reluctant to comply due to privacy concerns and technical difficulty of recording clean voice. Experiments on various test-time conditions show that the proposed personalization method can significantly improve the compact models' performance during the test time. Furthermore, since the personalized models outperform larger non-personalized baseline models, we claim that personalization achieves model compression with no loss of denoising performance. As expected, the student models underperform the state-of-the-art teacher models.

查看原文本刊更多论文

面向个性化语音增强的测试时间适应:基于知识蒸馏的零学习

在终端用户设备的实际语音增强设置中，我们经常只遇到少数扬声器和噪声类型，这些类型往往会在特定的声学环境中再次出现。我们提出了一种新的个性化语音增强方法，使一个紧凑的去噪模型适应测试时间的特殊性。在这个测试时间适应中，我们的目标是利用测试说话者没有干净的语音目标，从而满足零射击学习的要求。为了弥补缺乏干净的语音，我们采用了知识蒸馏框架:我们从一个过大的教师模型中提取更高级的去噪结果，并将它们作为伪目标来训练小的学生模型。这种零射击学习过程规避了收集用户干净语音的过程，因为用户出于隐私考虑和录制干净语音的技术难度而不愿意遵守这一过程。在各种测试时间条件下的实验表明，所提出的个性化方法可以显著提高紧凑模型在测试时间内的性能。此外，由于个性化模型优于较大的非个性化基线模型，我们声称个性化在不损失去噪性能的情况下实现了模型压缩。不出所料，学生模型不如最先进的教师模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

自引率

0.00%

发文量