Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald
{"title":"Speaker-IPL:利用基于 i-Vector 的伪标签对说话者特征进行无监督学习","authors":"Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald","doi":"arxiv-2409.10791","DOIUrl":null,"url":null,"abstract":"Iterative self-training, or iterative pseudo-labeling (IPL)--using an\nimproved model from the current iteration to provide pseudo-labels for the next\niteration--has proven to be a powerful approach to enhance the quality of\nspeaker representations. Recent applications of IPL in unsupervised speaker\nrecognition start with representations extracted from very elaborate\nself-supervised methods (e.g., DINO). However, training such strong\nself-supervised models is not straightforward (they require hyper-parameters\ntuning and may not generalize to out-of-domain data) and, moreover, may not be\nneeded at all. To this end, we show the simple, well-studied, and established\ni-vector generative model is enough to bootstrap the IPL process for\nunsupervised learning of speaker representations. We also systematically study\nthe impact of other components on the IPL process, which includes the initial\nmodel, the encoder, augmentations, the number of clusters, and the clustering\nalgorithm. Remarkably, we find that even with a simple and significantly weaker\ninitial model like i-vector, IPL can still achieve speaker verification\nperformance that rivals state-of-the-art methods.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels\",\"authors\":\"Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald\",\"doi\":\"arxiv-2409.10791\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Iterative self-training, or iterative pseudo-labeling (IPL)--using an\\nimproved model from the current iteration to provide pseudo-labels for the next\\niteration--has proven to be a powerful approach to enhance the quality of\\nspeaker representations. Recent applications of IPL in unsupervised speaker\\nrecognition start with representations extracted from very elaborate\\nself-supervised methods (e.g., DINO). However, training such strong\\nself-supervised models is not straightforward (they require hyper-parameters\\ntuning and may not generalize to out-of-domain data) and, moreover, may not be\\nneeded at all. To this end, we show the simple, well-studied, and established\\ni-vector generative model is enough to bootstrap the IPL process for\\nunsupervised learning of speaker representations. We also systematically study\\nthe impact of other components on the IPL process, which includes the initial\\nmodel, the encoder, augmentations, the number of clusters, and the clustering\\nalgorithm. Remarkably, we find that even with a simple and significantly weaker\\ninitial model like i-vector, IPL can still achieve speaker verification\\nperformance that rivals state-of-the-art methods.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.10791\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10791","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels
Iterative self-training, or iterative pseudo-labeling (IPL)--using an
improved model from the current iteration to provide pseudo-labels for the next
iteration--has proven to be a powerful approach to enhance the quality of
speaker representations. Recent applications of IPL in unsupervised speaker
recognition start with representations extracted from very elaborate
self-supervised methods (e.g., DINO). However, training such strong
self-supervised models is not straightforward (they require hyper-parameters
tuning and may not generalize to out-of-domain data) and, moreover, may not be
needed at all. To this end, we show the simple, well-studied, and established
i-vector generative model is enough to bootstrap the IPL process for
unsupervised learning of speaker representations. We also systematically study
the impact of other components on the IPL process, which includes the initial
model, the encoder, augmentations, the number of clusters, and the clustering
algorithm. Remarkably, we find that even with a simple and significantly weaker
initial model like i-vector, IPL can still achieve speaker verification
performance that rivals state-of-the-art methods.