Xi Xuan, Rong Jin, Tingyu Xuan, Guolei Du, Kaisheng Xuan
{"title":"Multi-Scene Robust Speaker Verification System Built on Improved ECAPA-TDNN","authors":"Xi Xuan, Rong Jin, Tingyu Xuan, Guolei Du, Kaisheng Xuan","doi":"10.1109/IAEAC54830.2022.9929964","DOIUrl":null,"url":null,"abstract":"In order to solve the problems of cross-domain, short speech, and noise interference in industrial application scenarios of speaker recognition, this paper proposes an improved ECAPA-TDNN for a multi-scene robust speaker verification system architecture-improved DD-ECAP A-TDNN.The design of the DD-ECAPA-TDNN architecture is inspired by the model ECAPA-TDNN, which has recently become popular in ASV systems. Firstly, we use FBanks to extract acoustic features, followed by the DD-SE-Res2Net Block proposed in this paper to capture local features efficiently. Finally, the output feature mapping of all DD-SE-Res2Net Blocks aggregated at multiple scales, and finally the ASP pooling operation is performed. The experiments were based on the VoxCeleb1-dev dataset, and SC-AAMSoftmax was used to train a speaker identification model for 1211 speakers. This DD-ECAPA-TDNN model was used as speaker embedding extractor to construct an automatic speaker verification (ASV) system. We used VoxMovies and VoxCeleb1-O evaluation sets to simulate three scenarios of cross-domain, short speech and noise interference, respectively, to evaluate the performance of the DD-ECAPA-TDNN system under multiple scenarios. The system achieves an EER of 2.51% on VoxCeleb1-O. The DD-ECAPA-TDNN system significantly outperforms the ECAPA-TDNN system in terms of recognition performance in multiple scenarios. Finally, our ablation experiments show that the DD-SE-Res2N et Block has a positive impact on the performance of the ASV system, as well as that the DD-ECAPA-TDNN can extract robust and accurate speaker embedding with good scene generalization.","PeriodicalId":349113,"journal":{"name":"2022 IEEE 6th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC )","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 6th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC )","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IAEAC54830.2022.9929964","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In order to solve the problems of cross-domain, short speech, and noise interference in industrial application scenarios of speaker recognition, this paper proposes an improved ECAPA-TDNN for a multi-scene robust speaker verification system architecture-improved DD-ECAP A-TDNN.The design of the DD-ECAPA-TDNN architecture is inspired by the model ECAPA-TDNN, which has recently become popular in ASV systems. Firstly, we use FBanks to extract acoustic features, followed by the DD-SE-Res2Net Block proposed in this paper to capture local features efficiently. Finally, the output feature mapping of all DD-SE-Res2Net Blocks aggregated at multiple scales, and finally the ASP pooling operation is performed. The experiments were based on the VoxCeleb1-dev dataset, and SC-AAMSoftmax was used to train a speaker identification model for 1211 speakers. This DD-ECAPA-TDNN model was used as speaker embedding extractor to construct an automatic speaker verification (ASV) system. We used VoxMovies and VoxCeleb1-O evaluation sets to simulate three scenarios of cross-domain, short speech and noise interference, respectively, to evaluate the performance of the DD-ECAPA-TDNN system under multiple scenarios. The system achieves an EER of 2.51% on VoxCeleb1-O. The DD-ECAPA-TDNN system significantly outperforms the ECAPA-TDNN system in terms of recognition performance in multiple scenarios. Finally, our ablation experiments show that the DD-SE-Res2N et Block has a positive impact on the performance of the ASV system, as well as that the DD-ECAPA-TDNN can extract robust and accurate speaker embedding with good scene generalization.