Shichao Hu, B. Liang, Zhouxuan Chen, Xiao Lu, Ethan Zhao, Simon Lui
{"title":"基于深度度量学习的大规模歌手识别实验研究","authors":"Shichao Hu, B. Liang, Zhouxuan Chen, Xiao Lu, Ethan Zhao, Simon Lui","doi":"10.1109/IJCNN52387.2021.9533911","DOIUrl":null,"url":null,"abstract":"Singer recognition aims to automatically recognize the singer of a given recording. Compared to spoken voices, singing voice is characterized by a much higher degree of vocal style. The task becomes more challenging when it operates on numerous singers. This paper explores different strategies in a deep metric learning framework, with special focus on their performance in a large-scale dataset consisting of audio samples from 5057 singers. We conduct thorough experiments to compare loss functions, including triplet loss, generalized end-to-end (GE2E) loss, and prototypical network (PN) loss. Effects of vocal source separation is also investigated. Using audio inputs with separated vocals, our model trained with PN loss outperforms other evaluated methods in the identification task. While in the verification task with one-on-one comparison of two single embeddings, triplet loss achieves the best results. However, verification using PN loss shows superior performance to methods with triplet loss when using the centroid of 5 embed dings to represent the singer embedding. Using longer segments for a singer representation consistently improves the performance for all evaluated tasks.","PeriodicalId":396583,"journal":{"name":"2021 International Joint Conference on Neural Networks (IJCNN)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Large-scale singer recognition using deep metric learning: an experimental study\",\"authors\":\"Shichao Hu, B. Liang, Zhouxuan Chen, Xiao Lu, Ethan Zhao, Simon Lui\",\"doi\":\"10.1109/IJCNN52387.2021.9533911\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Singer recognition aims to automatically recognize the singer of a given recording. Compared to spoken voices, singing voice is characterized by a much higher degree of vocal style. The task becomes more challenging when it operates on numerous singers. This paper explores different strategies in a deep metric learning framework, with special focus on their performance in a large-scale dataset consisting of audio samples from 5057 singers. We conduct thorough experiments to compare loss functions, including triplet loss, generalized end-to-end (GE2E) loss, and prototypical network (PN) loss. Effects of vocal source separation is also investigated. Using audio inputs with separated vocals, our model trained with PN loss outperforms other evaluated methods in the identification task. While in the verification task with one-on-one comparison of two single embeddings, triplet loss achieves the best results. However, verification using PN loss shows superior performance to methods with triplet loss when using the centroid of 5 embed dings to represent the singer embedding. Using longer segments for a singer representation consistently improves the performance for all evaluated tasks.\",\"PeriodicalId\":396583,\"journal\":{\"name\":\"2021 International Joint Conference on Neural Networks (IJCNN)\",\"volume\":\"77 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Joint Conference on Neural Networks (IJCNN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IJCNN52387.2021.9533911\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Joint Conference on Neural Networks (IJCNN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IJCNN52387.2021.9533911","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Large-scale singer recognition using deep metric learning: an experimental study
Singer recognition aims to automatically recognize the singer of a given recording. Compared to spoken voices, singing voice is characterized by a much higher degree of vocal style. The task becomes more challenging when it operates on numerous singers. This paper explores different strategies in a deep metric learning framework, with special focus on their performance in a large-scale dataset consisting of audio samples from 5057 singers. We conduct thorough experiments to compare loss functions, including triplet loss, generalized end-to-end (GE2E) loss, and prototypical network (PN) loss. Effects of vocal source separation is also investigated. Using audio inputs with separated vocals, our model trained with PN loss outperforms other evaluated methods in the identification task. While in the verification task with one-on-one comparison of two single embeddings, triplet loss achieves the best results. However, verification using PN loss shows superior performance to methods with triplet loss when using the centroid of 5 embed dings to represent the singer embedding. Using longer segments for a singer representation consistently improves the performance for all evaluated tasks.