Xi Xuan, Rong Jin, Tingyu Xuan, Guolei Du, Kaisheng Xuan
{"title":"基于改进ECAPA-TDNN的多场景鲁棒说话人验证系统","authors":"Xi Xuan, Rong Jin, Tingyu Xuan, Guolei Du, Kaisheng Xuan","doi":"10.1109/IAEAC54830.2022.9929964","DOIUrl":null,"url":null,"abstract":"In order to solve the problems of cross-domain, short speech, and noise interference in industrial application scenarios of speaker recognition, this paper proposes an improved ECAPA-TDNN for a multi-scene robust speaker verification system architecture-improved DD-ECAP A-TDNN.The design of the DD-ECAPA-TDNN architecture is inspired by the model ECAPA-TDNN, which has recently become popular in ASV systems. Firstly, we use FBanks to extract acoustic features, followed by the DD-SE-Res2Net Block proposed in this paper to capture local features efficiently. Finally, the output feature mapping of all DD-SE-Res2Net Blocks aggregated at multiple scales, and finally the ASP pooling operation is performed. The experiments were based on the VoxCeleb1-dev dataset, and SC-AAMSoftmax was used to train a speaker identification model for 1211 speakers. This DD-ECAPA-TDNN model was used as speaker embedding extractor to construct an automatic speaker verification (ASV) system. We used VoxMovies and VoxCeleb1-O evaluation sets to simulate three scenarios of cross-domain, short speech and noise interference, respectively, to evaluate the performance of the DD-ECAPA-TDNN system under multiple scenarios. The system achieves an EER of 2.51% on VoxCeleb1-O. The DD-ECAPA-TDNN system significantly outperforms the ECAPA-TDNN system in terms of recognition performance in multiple scenarios. Finally, our ablation experiments show that the DD-SE-Res2N et Block has a positive impact on the performance of the ASV system, as well as that the DD-ECAPA-TDNN can extract robust and accurate speaker embedding with good scene generalization.","PeriodicalId":349113,"journal":{"name":"2022 IEEE 6th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC )","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Multi-Scene Robust Speaker Verification System Built on Improved ECAPA-TDNN\",\"authors\":\"Xi Xuan, Rong Jin, Tingyu Xuan, Guolei Du, Kaisheng Xuan\",\"doi\":\"10.1109/IAEAC54830.2022.9929964\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In order to solve the problems of cross-domain, short speech, and noise interference in industrial application scenarios of speaker recognition, this paper proposes an improved ECAPA-TDNN for a multi-scene robust speaker verification system architecture-improved DD-ECAP A-TDNN.The design of the DD-ECAPA-TDNN architecture is inspired by the model ECAPA-TDNN, which has recently become popular in ASV systems. Firstly, we use FBanks to extract acoustic features, followed by the DD-SE-Res2Net Block proposed in this paper to capture local features efficiently. Finally, the output feature mapping of all DD-SE-Res2Net Blocks aggregated at multiple scales, and finally the ASP pooling operation is performed. The experiments were based on the VoxCeleb1-dev dataset, and SC-AAMSoftmax was used to train a speaker identification model for 1211 speakers. This DD-ECAPA-TDNN model was used as speaker embedding extractor to construct an automatic speaker verification (ASV) system. We used VoxMovies and VoxCeleb1-O evaluation sets to simulate three scenarios of cross-domain, short speech and noise interference, respectively, to evaluate the performance of the DD-ECAPA-TDNN system under multiple scenarios. The system achieves an EER of 2.51% on VoxCeleb1-O. The DD-ECAPA-TDNN system significantly outperforms the ECAPA-TDNN system in terms of recognition performance in multiple scenarios. Finally, our ablation experiments show that the DD-SE-Res2N et Block has a positive impact on the performance of the ASV system, as well as that the DD-ECAPA-TDNN can extract robust and accurate speaker embedding with good scene generalization.\",\"PeriodicalId\":349113,\"journal\":{\"name\":\"2022 IEEE 6th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC )\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 6th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC )\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IAEAC54830.2022.9929964\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 6th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC )","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IAEAC54830.2022.9929964","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
为了解决工业应用场景中说话人识别存在的跨域、短语音、噪声干扰等问题,本文提出了一种多场景鲁棒说话人验证系统架构的改进ecap - tdnn——改进DD-ECAP - tdnn。DD-ECAPA-TDNN架构的设计灵感来自于最近在ASV系统中流行的ECAPA-TDNN模型。首先,我们使用FBanks提取声学特征,然后使用本文提出的DD-SE-Res2Net Block高效捕获局部特征。最后输出多个尺度聚合的所有DD-SE-Res2Net block的特征映射,最后进行ASP池化操作。实验基于VoxCeleb1-dev数据集,使用SC-AAMSoftmax对1211个说话人进行说话人识别模型训练。将DD-ECAPA-TDNN模型作为说话人嵌入提取器,构建了一个说话人自动验证系统。我们使用VoxMovies和VoxCeleb1-O评估集分别模拟了跨域、短语音和噪声干扰三种场景,评估了DD-ECAPA-TDNN系统在多种场景下的性能。该系统在VoxCeleb1-O上实现了2.51%的EER。DD-ECAPA-TDNN系统在多种场景下的识别性能显著优于ECAPA-TDNN系统。最后,我们的烧蚀实验表明,DD-SE-Res2N et Block对ASV系统的性能有积极的影响,并且DD-ECAPA-TDNN可以提取鲁棒准确的说话人嵌入,具有良好的场景泛化。
Multi-Scene Robust Speaker Verification System Built on Improved ECAPA-TDNN
In order to solve the problems of cross-domain, short speech, and noise interference in industrial application scenarios of speaker recognition, this paper proposes an improved ECAPA-TDNN for a multi-scene robust speaker verification system architecture-improved DD-ECAP A-TDNN.The design of the DD-ECAPA-TDNN architecture is inspired by the model ECAPA-TDNN, which has recently become popular in ASV systems. Firstly, we use FBanks to extract acoustic features, followed by the DD-SE-Res2Net Block proposed in this paper to capture local features efficiently. Finally, the output feature mapping of all DD-SE-Res2Net Blocks aggregated at multiple scales, and finally the ASP pooling operation is performed. The experiments were based on the VoxCeleb1-dev dataset, and SC-AAMSoftmax was used to train a speaker identification model for 1211 speakers. This DD-ECAPA-TDNN model was used as speaker embedding extractor to construct an automatic speaker verification (ASV) system. We used VoxMovies and VoxCeleb1-O evaluation sets to simulate three scenarios of cross-domain, short speech and noise interference, respectively, to evaluate the performance of the DD-ECAPA-TDNN system under multiple scenarios. The system achieves an EER of 2.51% on VoxCeleb1-O. The DD-ECAPA-TDNN system significantly outperforms the ECAPA-TDNN system in terms of recognition performance in multiple scenarios. Finally, our ablation experiments show that the DD-SE-Res2N et Block has a positive impact on the performance of the ASV system, as well as that the DD-ECAPA-TDNN can extract robust and accurate speaker embedding with good scene generalization.