{"title":"利用广播式 CNN 变换器和自我注意力地图的知识蒸馏训练实现高效轻量级扬声器验证","authors":"Jeong-Hwan Choi;Joon-Young Yang;Joon-Hyuk Chang","doi":"10.1109/TASLP.2024.3463491","DOIUrl":null,"url":null,"abstract":"Developing a lightweight speaker embedding extractor (SEE) is crucial for the practical implementation of automatic speaker verification (ASV) systems. To this end, we recently introduced \n<italic>broadcasting convolutional neural networks (CNNs)-meet-vision-Transformers</i>\n (BC-CMT), a lightweight SEE that utilizes broadcasted residual learning (BRL) within the hybrid CNN-Transformer architecture to maintain a small number of model parameters. We proposed three BC-CMT-based SEE with three different sizes: BC-CMT-Tiny, -Small, and -Base. In this study, we extend our previously proposed BC-CMT by introducing an improved model architecture and a training strategy based on knowledge distillation (KD) using self-attention (SA) maps. First, to reduce the computational costs and latency of the BC-CMT, the two-dimensional (2D) SA operations in the BC-CMT, which calculate the SA maps in the frequency–time dimensions, are simplified to 1D SA operations that consider only temporal importance. Moreover, to enhance the SA capability of the BC-CMT, the group convolution layers in the SA block are adjusted to have smaller number of groups and are combined with the BRL operations. Second, to improve the training effectiveness of the modified BC-CMT-Tiny, the SA maps of a pretrained large BC-CMT-Base are used for the KD to guide those of a smaller BC-CMT-Tiny. Because the attention map sizes of the modified BC-CMT models do not depend on the number of frequency bins or convolution channels, the proposed strategy enables KD between feature maps with different sizes. The experimental results demonstrate that the proposed BC-CMT-Tiny model having 271.44K model parameters achieved 36.8% and 9.3% reduction in floating point operations on 1s signals and equal error rate (EER) on VoxCeleb 1 testset, respectively, compared to the conventional BC-CMT-Tiny. The CPU and GPU running time of the proposed BC-CMT-Tiny ranges of 1 to 10 s signals were 29.07 to 146.32 ms and 36.01 to 206.43 ms, respectively. The proposed KD further reduced the EER by 15.5% with improved attention capability.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4580-4595"},"PeriodicalIF":4.1000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps\",\"authors\":\"Jeong-Hwan Choi;Joon-Young Yang;Joon-Hyuk Chang\",\"doi\":\"10.1109/TASLP.2024.3463491\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Developing a lightweight speaker embedding extractor (SEE) is crucial for the practical implementation of automatic speaker verification (ASV) systems. To this end, we recently introduced \\n<italic>broadcasting convolutional neural networks (CNNs)-meet-vision-Transformers</i>\\n (BC-CMT), a lightweight SEE that utilizes broadcasted residual learning (BRL) within the hybrid CNN-Transformer architecture to maintain a small number of model parameters. We proposed three BC-CMT-based SEE with three different sizes: BC-CMT-Tiny, -Small, and -Base. In this study, we extend our previously proposed BC-CMT by introducing an improved model architecture and a training strategy based on knowledge distillation (KD) using self-attention (SA) maps. First, to reduce the computational costs and latency of the BC-CMT, the two-dimensional (2D) SA operations in the BC-CMT, which calculate the SA maps in the frequency–time dimensions, are simplified to 1D SA operations that consider only temporal importance. Moreover, to enhance the SA capability of the BC-CMT, the group convolution layers in the SA block are adjusted to have smaller number of groups and are combined with the BRL operations. Second, to improve the training effectiveness of the modified BC-CMT-Tiny, the SA maps of a pretrained large BC-CMT-Base are used for the KD to guide those of a smaller BC-CMT-Tiny. Because the attention map sizes of the modified BC-CMT models do not depend on the number of frequency bins or convolution channels, the proposed strategy enables KD between feature maps with different sizes. The experimental results demonstrate that the proposed BC-CMT-Tiny model having 271.44K model parameters achieved 36.8% and 9.3% reduction in floating point operations on 1s signals and equal error rate (EER) on VoxCeleb 1 testset, respectively, compared to the conventional BC-CMT-Tiny. The CPU and GPU running time of the proposed BC-CMT-Tiny ranges of 1 to 10 s signals were 29.07 to 146.32 ms and 36.01 to 206.43 ms, respectively. The proposed KD further reduced the EER by 15.5% with improved attention capability.\",\"PeriodicalId\":13332,\"journal\":{\"name\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"volume\":\"32 \",\"pages\":\"4580-4595\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10683974/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10683974/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
摘要
开发轻量级扬声器嵌入提取器(SEE)对于实际应用自动扬声器验证(ASV)系统至关重要。为此,我们最近推出了广播卷积神经网络(CNN)-视觉变换器(BC-CMT),这是一种轻量级 SEE,它在混合 CNN-变换器架构中利用广播残差学习(BRL)来保持少量模型参数。我们提出了三种不同规模的基于 BC-CMT 的 SEE:BC-CMT-Tiny、-Small 和-Base。在本研究中,我们对之前提出的 BC-CMT 进行了扩展,引入了改进的模型架构和基于知识提炼(KD)的训练策略,并使用了自我注意(SA)地图。首先,为了降低 BC-CMT 的计算成本和延迟,我们将 BC-CMT 中计算频率-时间维度 SA 地图的二维 (2D) SA 操作简化为只考虑时间重要性的一维 SA 操作。此外,为了增强 BC-CMT 的 SA 能力,还将 SA 块中的分组卷积层调整为较少的分组数,并与 BRL 运算相结合。其次,为提高改进型 BC-CMT-Tiny 的训练效果,在 KD 中使用预训练的大型 BC-CMT-Base 的 SA 地图来指导小型 BC-CMT-Tiny 的 SA 地图。由于修改后的 BC-CMT 模型的注意图大小并不取决于频带数或卷积通道数,因此所提出的策略可以在不同大小的特征图之间进行 KD。实验结果表明,与传统的 BC-CMT-Tiny 模型相比,拥有 271.44K 模型参数的 BC-CMT-Tiny 模型在 VoxCeleb 1 测试集上的 1s 信号浮点运算和等错误率(EER)分别减少了 36.8% 和 9.3%。建议的 BC-CMT-Tiny 在 1 至 10 s 信号范围内的 CPU 和 GPU 运行时间分别为 29.07 至 146.32 ms 和 36.01 至 206.43 ms。随着注意力能力的提高,拟议的 KD 进一步将 EER 降低了 15.5%。
Efficient Lightweight Speaker Verification With Broadcasting CNN-Transformer and Knowledge Distillation Training of Self-Attention Maps
Developing a lightweight speaker embedding extractor (SEE) is crucial for the practical implementation of automatic speaker verification (ASV) systems. To this end, we recently introduced
broadcasting convolutional neural networks (CNNs)-meet-vision-Transformers
(BC-CMT), a lightweight SEE that utilizes broadcasted residual learning (BRL) within the hybrid CNN-Transformer architecture to maintain a small number of model parameters. We proposed three BC-CMT-based SEE with three different sizes: BC-CMT-Tiny, -Small, and -Base. In this study, we extend our previously proposed BC-CMT by introducing an improved model architecture and a training strategy based on knowledge distillation (KD) using self-attention (SA) maps. First, to reduce the computational costs and latency of the BC-CMT, the two-dimensional (2D) SA operations in the BC-CMT, which calculate the SA maps in the frequency–time dimensions, are simplified to 1D SA operations that consider only temporal importance. Moreover, to enhance the SA capability of the BC-CMT, the group convolution layers in the SA block are adjusted to have smaller number of groups and are combined with the BRL operations. Second, to improve the training effectiveness of the modified BC-CMT-Tiny, the SA maps of a pretrained large BC-CMT-Base are used for the KD to guide those of a smaller BC-CMT-Tiny. Because the attention map sizes of the modified BC-CMT models do not depend on the number of frequency bins or convolution channels, the proposed strategy enables KD between feature maps with different sizes. The experimental results demonstrate that the proposed BC-CMT-Tiny model having 271.44K model parameters achieved 36.8% and 9.3% reduction in floating point operations on 1s signals and equal error rate (EER) on VoxCeleb 1 testset, respectively, compared to the conventional BC-CMT-Tiny. The CPU and GPU running time of the proposed BC-CMT-Tiny ranges of 1 to 10 s signals were 29.07 to 146.32 ms and 36.01 to 206.43 ms, respectively. The proposed KD further reduced the EER by 15.5% with improved attention capability.
期刊介绍:
The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.