{"title":"Learning Discriminative Speaker Embedding by Improving Aggregation Strategy and Loss Function for Speaker Verification","authors":"Chengfang Luo, Xin Guo, Aiwen Deng, Wei Xu, Junhong Zhao, Wenxiong Kang","doi":"10.1109/IJCB52358.2021.9484331","DOIUrl":null,"url":null,"abstract":"The embedding-based speaker verification (SV) technology has witnessed significant progress due to the advances of deep convolutional neural networks (DCNN). However, how to improve the discrimination of speaker embedding in the open world SV task is still the focus of current research in the community. In this paper, we improve the discriminative power of speaker embedding from three-fold: (1) NeXtVLAD is introduced to aggregate frame-level features, which decomposes the high-dimensional frame-level features into a group of low-dimensional vectors before applying VLAD aggregation. (2) A multi-scale aggregation strategy (MSA) assembled with NeXtVLAD is designed with the purpose of fully extract speaker information from the frame-level feature in different hidden layers of DCNN. (3) A mutually complementary assembling loss function is proposed to train the model, which consists of a prototypical loss and a marginal-based softmax loss. Extensive experiments have been conducted on the VoxCeleb-1 dataset, and the experimental results show that our proposed system can obtain significant performance improvements compared with the baseline, and obtains new state-of-the-art results. The source code of this paper is available at https://github.com/LCF2764/Discriminative-Speaker-Embedding.","PeriodicalId":175984,"journal":{"name":"2021 IEEE International Joint Conference on Biometrics (IJCB)","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Joint Conference on Biometrics (IJCB)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCB52358.2021.9484331","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
The embedding-based speaker verification (SV) technology has witnessed significant progress due to the advances of deep convolutional neural networks (DCNN). However, how to improve the discrimination of speaker embedding in the open world SV task is still the focus of current research in the community. In this paper, we improve the discriminative power of speaker embedding from three-fold: (1) NeXtVLAD is introduced to aggregate frame-level features, which decomposes the high-dimensional frame-level features into a group of low-dimensional vectors before applying VLAD aggregation. (2) A multi-scale aggregation strategy (MSA) assembled with NeXtVLAD is designed with the purpose of fully extract speaker information from the frame-level feature in different hidden layers of DCNN. (3) A mutually complementary assembling loss function is proposed to train the model, which consists of a prototypical loss and a marginal-based softmax loss. Extensive experiments have been conducted on the VoxCeleb-1 dataset, and the experimental results show that our proposed system can obtain significant performance improvements compared with the baseline, and obtains new state-of-the-art results. The source code of this paper is available at https://github.com/LCF2764/Discriminative-Speaker-Embedding.