{"title":"Joint Optimization of Classification and Clustering for Deep Speaker Embedding","authors":"Zhiming Wang, K. Yao, Shuo Fang, Xiaolong Li","doi":"10.1109/ASRU46091.2019.9003860","DOIUrl":null,"url":null,"abstract":"This paper proposes a method to train deep speaker embed-dings end-to-end that jointly optimizes classification and clustering. A large margin softmax loss is used to reduce classification errors. A novel large margin Gaussian mixture loss is proposed to improve clustering. With the joint optimization, the learned embeddings capture segment-level acoustic representation from variable-length speech segments to discriminate between speakers and to replicate densities of speaker clusters. We compare performance with alternative methods on large-scale text-independent speaker recognition dataset VoxCeleb1 [1] and observe that it outperforms those methods significantly, achieving new state-of-the-art results on the dataset. Moreover, because of the joint optimization, this method exhibits faster and better convergence than using classification loss alone. Our results suggest great potential of joint optimization of classification and clustering for speaker verification and identification.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003860","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
This paper proposes a method to train deep speaker embed-dings end-to-end that jointly optimizes classification and clustering. A large margin softmax loss is used to reduce classification errors. A novel large margin Gaussian mixture loss is proposed to improve clustering. With the joint optimization, the learned embeddings capture segment-level acoustic representation from variable-length speech segments to discriminate between speakers and to replicate densities of speaker clusters. We compare performance with alternative methods on large-scale text-independent speaker recognition dataset VoxCeleb1 [1] and observe that it outperforms those methods significantly, achieving new state-of-the-art results on the dataset. Moreover, because of the joint optimization, this method exhibits faster and better convergence than using classification loss alone. Our results suggest great potential of joint optimization of classification and clustering for speaker verification and identification.