Yong-bin Yu, Min-hui Qi, Yi-fan Tang, Quan-xin Deng, Chenhui Peng, Feng Mai, T. Nyima
{"title":"细心的深度CNN说话者验证","authors":"Yong-bin Yu, Min-hui Qi, Yi-fan Tang, Quan-xin Deng, Chenhui Peng, Feng Mai, T. Nyima","doi":"10.1117/12.2581351","DOIUrl":null,"url":null,"abstract":"In this paper, an end-to-end speaker verification system based on attentive deep convolutional neural network (CNN) is highlighted. It takes log filter bank coefficients as input and measures speaker similarity between a test utterance and enrollment utterances by cosine similarity for verification. The approach utilizes the channel attention module of convolutional block attention module (CBAM) to increase representation power by giving different weights to feature maps. In addition, softmax is used to pre-train for initializing the weights of the network and tuple-based end-to-end (TE2E) loss function is responsible for fine-tune in evaluation stage, such a strategy not only results in notable improvements over the baseline model but also allows for direct optimization of the evaluation metric. Experimental results on VoxCeleb dataset indicates that proposed model achieves an equal error rate (EER) of 3.83%, which is slightly worse than x-vectors while outperforms i-vectors.","PeriodicalId":415097,"journal":{"name":"International Conference on Signal Processing Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Attentive deep CNN for speaker verification\",\"authors\":\"Yong-bin Yu, Min-hui Qi, Yi-fan Tang, Quan-xin Deng, Chenhui Peng, Feng Mai, T. Nyima\",\"doi\":\"10.1117/12.2581351\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, an end-to-end speaker verification system based on attentive deep convolutional neural network (CNN) is highlighted. It takes log filter bank coefficients as input and measures speaker similarity between a test utterance and enrollment utterances by cosine similarity for verification. The approach utilizes the channel attention module of convolutional block attention module (CBAM) to increase representation power by giving different weights to feature maps. In addition, softmax is used to pre-train for initializing the weights of the network and tuple-based end-to-end (TE2E) loss function is responsible for fine-tune in evaluation stage, such a strategy not only results in notable improvements over the baseline model but also allows for direct optimization of the evaluation metric. Experimental results on VoxCeleb dataset indicates that proposed model achieves an equal error rate (EER) of 3.83%, which is slightly worse than x-vectors while outperforms i-vectors.\",\"PeriodicalId\":415097,\"journal\":{\"name\":\"International Conference on Signal Processing Systems\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Signal Processing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2581351\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Signal Processing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2581351","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
In this paper, an end-to-end speaker verification system based on attentive deep convolutional neural network (CNN) is highlighted. It takes log filter bank coefficients as input and measures speaker similarity between a test utterance and enrollment utterances by cosine similarity for verification. The approach utilizes the channel attention module of convolutional block attention module (CBAM) to increase representation power by giving different weights to feature maps. In addition, softmax is used to pre-train for initializing the weights of the network and tuple-based end-to-end (TE2E) loss function is responsible for fine-tune in evaluation stage, such a strategy not only results in notable improvements over the baseline model but also allows for direct optimization of the evaluation metric. Experimental results on VoxCeleb dataset indicates that proposed model achieves an equal error rate (EER) of 3.83%, which is slightly worse than x-vectors while outperforms i-vectors.