Zhengyang Chen, Yao Qian, Bing Han, Y. Qian, Michael Zeng
{"title":"A Comprehensive Study on Self-Supervised Distillation for Speaker Representation Learning","authors":"Zhengyang Chen, Yao Qian, Bing Han, Y. Qian, Michael Zeng","doi":"10.1109/SLT54892.2023.10022470","DOIUrl":null,"url":null,"abstract":"In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb 1 speaker verification evaluation benchmark (i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791 % for trial Vox1-O, Vox1-E and Vox1-H, respectively), discarding any speaker labels in the training phase.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT54892.2023.10022470","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
In real application scenarios, it is often challenging to obtain a large amount of labeled data for speaker representation learning due to speaker privacy concerns. Self-supervised learning with no labels has become a more and more promising way to solve it. Compared with contrastive learning, self-distilled approaches use only positive samples in the loss function and thus are more attractive. In this paper, we present a comprehensive study on self-distilled self-supervised speaker representation learning, especially on critical data augmentation. Our proposed strategy of audio perturbation augmentation has pushed the performance of the speaker representation to a new limit. The experimental results show that our model can achieve a new SoTA on Voxceleb 1 speaker verification evaluation benchmark (i.e., equal error rate (EER) 2.505%, 2.473%, and 4.791 % for trial Vox1-O, Vox1-E and Vox1-H, respectively), discarding any speaker labels in the training phase.