A Comprehensive Study on Self-Supervised Distillation for Speaker Representation Learning

Zhengyang Chen, Yao Qian, Bing Han, Y. Qian, Michael Zeng
{"title":"A Comprehensive Study on Self-Supervised Distillation for Speaker Representation Learning","authors":"Zhengyang Chen, Yao Qian, Bing Han, Y. Qian, Michael Zeng","doi":"10.1109/SLT54892.2023.10022470","DOIUrl":null,"url":null,"abstract":"In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb 1 speaker verification evaluation benchmark (i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791 % for trial Vox1-O, Vox1-E and Vox1-H, respectively), discarding any speaker labels in the training phase.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT54892.2023.10022470","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb 1 speaker verification evaluation benchmark (i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791 % for trial Vox1-O, Vox1-E and Vox1-H, respectively), discarding any speaker labels in the training phase.
说话人表征学习的自监督蒸馏综合研究
在实际应用场景中,由于讲话者隐私问题,获取大量用于讲话者表示学习的标记数据通常具有挑战性。无标签自监督学习已经成为一种越来越有前途的解决方法。与对比学习相比,自蒸馏方法在损失函数中只使用正样本,因此更具吸引力。在本文中,我们对自提取自监督说话人表示学习进行了全面的研究,特别是在关键数据增强方面。我们提出的音频扰动增强策略将说话人表示的性能推向了一个新的极限。实验结果表明,我们的模型可以在Voxceleb 1的说话人验证评价基准上实现新的SoTA(即对Vox1-O, Vox1-E和Vox1-H的等错误率(EER)分别为2.505%,2.473%和4.791%),并且在训练阶段丢弃任何说话人标签。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信