{"title":"文本独立说话人验证的外部关注统计池","authors":"Lidong Pan, Chunhao He, Tieyuan Chang","doi":"10.1109/CCAI57533.2023.10201326","DOIUrl":null,"url":null,"abstract":"Speaker verification is an important biometric identification technique. In the neural network-based speaker feature extraction model, the pooling layer plays an important role. This layer aggregates frame-level features to obtain utterance-level features, and different pooling methods have different effects on the aggregation of frame-level features, which in turn affects the characterization ability of the final speaker features. In the existing work, some pooling methods with attention mechanisms have shown stronger feature aggregation capability than traditional pooling methods. In this paper, we combine a low-complexity External Attention with statistics pooling to design External-Attentive Statistics Pooling and propose Multi-Group External-Attentive Statistics Pooling considering the biological properties of human hearing. The two methods are used in text-independent speaker verification and tested on the VoxCeleb1 test set, VoxCeleb1-H, and VoxCeleb1-E. The test results show that the proposed method achieves more effective feature aggregation without significantly increasing the number of model parameters.","PeriodicalId":285760,"journal":{"name":"2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"External-Attentive Statistics Pooling for Text-Independent Speaker Verification\",\"authors\":\"Lidong Pan, Chunhao He, Tieyuan Chang\",\"doi\":\"10.1109/CCAI57533.2023.10201326\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speaker verification is an important biometric identification technique. In the neural network-based speaker feature extraction model, the pooling layer plays an important role. This layer aggregates frame-level features to obtain utterance-level features, and different pooling methods have different effects on the aggregation of frame-level features, which in turn affects the characterization ability of the final speaker features. In the existing work, some pooling methods with attention mechanisms have shown stronger feature aggregation capability than traditional pooling methods. In this paper, we combine a low-complexity External Attention with statistics pooling to design External-Attentive Statistics Pooling and propose Multi-Group External-Attentive Statistics Pooling considering the biological properties of human hearing. The two methods are used in text-independent speaker verification and tested on the VoxCeleb1 test set, VoxCeleb1-H, and VoxCeleb1-E. The test results show that the proposed method achieves more effective feature aggregation without significantly increasing the number of model parameters.\",\"PeriodicalId\":285760,\"journal\":{\"name\":\"2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCAI57533.2023.10201326\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCAI57533.2023.10201326","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
External-Attentive Statistics Pooling for Text-Independent Speaker Verification
Speaker verification is an important biometric identification technique. In the neural network-based speaker feature extraction model, the pooling layer plays an important role. This layer aggregates frame-level features to obtain utterance-level features, and different pooling methods have different effects on the aggregation of frame-level features, which in turn affects the characterization ability of the final speaker features. In the existing work, some pooling methods with attention mechanisms have shown stronger feature aggregation capability than traditional pooling methods. In this paper, we combine a low-complexity External Attention with statistics pooling to design External-Attentive Statistics Pooling and propose Multi-Group External-Attentive Statistics Pooling considering the biological properties of human hearing. The two methods are used in text-independent speaker verification and tested on the VoxCeleb1 test set, VoxCeleb1-H, and VoxCeleb1-E. The test results show that the proposed method achieves more effective feature aggregation without significantly increasing the number of model parameters.