渐进式信道融合可提高 TDNN 在扬声器验证方面的效率

IF 3 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2024-07-23 DOI:10.1016/j.specom.2024.103105

Zhenduo Zhao , Zhuo Li , Wenchao Wang , Ji Xu

{"title":"渐进式信道融合可提高 TDNN 在扬声器验证方面的效率","authors":"Zhenduo Zhao , Zhuo Li , Wenchao Wang , Ji Xu","doi":"10.1016/j.specom.2024.103105","DOIUrl":null,"url":null,"abstract":"<div><p>ECAPA-TDNN is one of the most popular TDNNs for speaker verification. While most of the updates pay attention to building precisely designed auxiliary modules, the depth-first principle has shown promising performance recently. However, empirical experiments show that one-dimensional convolution (Conv1D) based TDNNs suffer from performance degradation by simply adding massive vanilla basic blocks. Note that Conv1D naturally has a global receptive field (RF) on the feature dimension, progressive channel fusion (PCF) is proposed to alleviate this issue by introducing group convolution to build local RF and fusing the subbands progressively. Instead of reducing the group number in convolution layers used in the previous work, a novel channel permutation strategy is introduced to build information flow between groups so that all basic blocks in the model keep consistent parameter efficiency. The information leakage from lower-frequency bands to higher ones caused by Res2Block is simultaneously solved by introducing group-in-group convolution and using channel permutation. Besides the PCF strategy, redundant connections are removed for a more concise model architecture. The experiments on VoxCeleb and CnCeleb achieve state-of-the-art (SOTA) performance with an average relative improvement of 12.3% on EER and 13.2% on minDCF (0.01), validating the effectiveness of the proposed model.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"163 ","pages":"Article 103105"},"PeriodicalIF":3.0000,"publicationDate":"2024-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Progressive channel fusion for more efficient TDNN on speaker verification\",\"authors\":\"Zhenduo Zhao , Zhuo Li , Wenchao Wang , Ji Xu\",\"doi\":\"10.1016/j.specom.2024.103105\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>ECAPA-TDNN is one of the most popular TDNNs for speaker verification. While most of the updates pay attention to building precisely designed auxiliary modules, the depth-first principle has shown promising performance recently. However, empirical experiments show that one-dimensional convolution (Conv1D) based TDNNs suffer from performance degradation by simply adding massive vanilla basic blocks. Note that Conv1D naturally has a global receptive field (RF) on the feature dimension, progressive channel fusion (PCF) is proposed to alleviate this issue by introducing group convolution to build local RF and fusing the subbands progressively. Instead of reducing the group number in convolution layers used in the previous work, a novel channel permutation strategy is introduced to build information flow between groups so that all basic blocks in the model keep consistent parameter efficiency. The information leakage from lower-frequency bands to higher ones caused by Res2Block is simultaneously solved by introducing group-in-group convolution and using channel permutation. Besides the PCF strategy, redundant connections are removed for a more concise model architecture. The experiments on VoxCeleb and CnCeleb achieve state-of-the-art (SOTA) performance with an average relative improvement of 12.3% on EER and 13.2% on minDCF (0.01), validating the effectiveness of the proposed model.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"163 \",\"pages\":\"Article 103105\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2024-07-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324000773\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000773","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

ECAPA-TDNN 是用于扬声器验证的最流行 TDNN 之一。虽然大多数更新都注重构建精确设计的辅助模块，但深度优先原则最近已显示出良好的性能。然而，经验实验表明，基于一维卷积（Conv1D）的 TDNN 会因为简单地添加大量 vanilla 基本模块而导致性能下降。注意到 Conv1D 在特征维度上天然具有全局感受野（RF），我们提出了渐进信道融合（PCF），通过引入组卷积来建立局部 RF 并逐步融合子带，从而缓解这一问题。我们没有采用前人的方法来减少卷积层中的组数，而是引入了一种新颖的信道置换策略来建立组间信息流，从而使模型中的所有基本模块都能保持一致的参数效率。通过引入组内卷积和使用信道置换，同时解决了 Res2Block 造成的低频段向高频段的信息泄漏问题。除了 PCF 策略外，还移除了冗余连接，使模型结构更加简洁。在 VoxCeleb 和 CnCeleb 上进行的实验取得了最先进（SOTA）的性能，在 EER 和 minDCF (0.01) 上分别平均提高了 12.3% 和 13.2%，验证了所提模型的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Progressive channel fusion for more efficient TDNN on speaker verification

ECAPA-TDNN is one of the most popular TDNNs for speaker verification. While most of the updates pay attention to building precisely designed auxiliary modules, the depth-first principle has shown promising performance recently. However, empirical experiments show that one-dimensional convolution (Conv1D) based TDNNs suffer from performance degradation by simply adding massive vanilla basic blocks. Note that Conv1D naturally has a global receptive field (RF) on the feature dimension, progressive channel fusion (PCF) is proposed to alleviate this issue by introducing group convolution to build local RF and fusing the subbands progressively. Instead of reducing the group number in convolution layers used in the previous work, a novel channel permutation strategy is introduced to build information flow between groups so that all basic blocks in the model keep consistent parameter efficiency. The information leakage from lower-frequency bands to higher ones caused by Res2Block is simultaneously solved by introducing group-in-group convolution and using channel permutation. Besides the PCF strategy, redundant connections are removed for a more concise model architecture. The experiments on VoxCeleb and CnCeleb achieve state-of-the-art (SOTA) performance with an average relative improvement of 12.3% on EER and 13.2% on minDCF (0.01), validating the effectiveness of the proposed model.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.