说话人表征学习的自监督蒸馏综合研究

2022 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2022-10-28 DOI:10.1109/SLT54892.2023.10022470

Zhengyang Chen, Yao Qian, Bing Han, Y. Qian, Michael Zeng

{"title":"说话人表征学习的自监督蒸馏综合研究","authors":"Zhengyang Chen, Yao Qian, Bing Han, Y. Qian, Michael Zeng","doi":"10.1109/SLT54892.2023.10022470","DOIUrl":null,"url":null,"abstract":"In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb 1 speaker verification evaluation benchmark (i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791 % for trial Vox1-O, Vox1-E and Vox1-H, respectively), discarding any speaker labels in the training phase.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"A Comprehensive Study on Self-Supervised Distillation for Speaker Representation Learning\",\"authors\":\"Zhengyang Chen, Yao Qian, Bing Han, Y. Qian, Michael Zeng\",\"doi\":\"10.1109/SLT54892.2023.10022470\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb 1 speaker verification evaluation benchmark (i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791 % for trial Vox1-O, Vox1-E and Vox1-H, respectively), discarding any speaker labels in the training phase.\",\"PeriodicalId\":352002,\"journal\":{\"name\":\"2022 IEEE Spoken Language Technology Workshop (SLT)\",\"volume\":\"67 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE Spoken Language Technology Workshop (SLT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SLT54892.2023.10022470\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT54892.2023.10022470","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

在实际应用场景中，由于讲话者隐私问题，获取大量用于讲话者表示学习的标记数据通常具有挑战性。无标签自监督学习已经成为一种越来越有前途的解决方法。与对比学习相比，自蒸馏方法在损失函数中只使用正样本，因此更具吸引力。在本文中，我们对自提取自监督说话人表示学习进行了全面的研究，特别是在关键数据增强方面。我们提出的音频扰动增强策略将说话人表示的性能推向了一个新的极限。实验结果表明，我们的模型可以在Voxceleb 1的说话人验证评价基准上实现新的SoTA(即对Vox1-O, Vox1-E和Vox1-H的等错误率(EER)分别为2.505%，2.473%和4.791%)，并且在训练阶段丢弃任何说话人标签。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Comprehensive Study on Self-Supervised Distillation for Speaker Representation Learning

In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb 1 speaker verification evaluation benchmark (i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791 % for trial Vox1-O, Vox1-E and Vox1-H, respectively), discarding any speaker labels in the training phase.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量