Rep-MCA-former: An efficient multi-scale convolution attention encoder for text-independent speaker verification

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2023-12-10 DOI:10.1016/j.csl.2023.101600

Xiaohu Liu, Defu Chen, Xianbao Wang, Sheng Xiang, Xuwen Zhou

{"title":"Rep-MCA-former: An efficient multi-scale convolution attention encoder for text-independent speaker verification","authors":"Xiaohu Liu, Defu Chen, Xianbao Wang, Sheng Xiang, Xuwen Zhou","doi":"10.1016/j.csl.2023.101600","DOIUrl":null,"url":null,"abstract":"<div><p><span>In many speaker verification tasks, the quality of speaker embedding is an important factor in affecting speaker verification systems. Advanced speaker embedding extraction networks aim to capture richer speaker features through the multi-branch </span>network architecture. Recently, speaker verification systems based on transformer encoders have received much attention, and many satisfactory results have been achieved because transformer encoders can efficiently extract the global features of the speaker (e.g., MFA-Conformer). However, the large number of model parameters and computational latency are common problems faced by the above approaches, which make them difficult to apply to resource-constrained edge terminals. To address this issue, this paper proposes an effective, lightweight transformer model (MCA-former) with multi-scale convolutional self-attention (MCA), which can perform multi-scale modeling and channel modeling in the temporal direction of the input with low computational cost. In addition, in the inference phase of the model, we further develop a systematic re-parameterization method to convert the multi-branch network structure into the single-path topology, effectively improving the inference speed. We investigate the performance of the MCA-former for speaker verification under the VoxCeleb1 test set. The results show that the MCA-based transformer model is more advantageous in terms of the number of parameters and inference efficiency. By applying the re-parameterization, the inference speed of the model is increased by about 30%, and the memory consumption is significantly improved.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2023-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230823001195","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In many speaker verification tasks, the quality of speaker embedding is an important factor in affecting speaker verification systems. Advanced speaker embedding extraction networks aim to capture richer speaker features through the multi-branch network architecture. Recently, speaker verification systems based on transformer encoders have received much attention, and many satisfactory results have been achieved because transformer encoders can efficiently extract the global features of the speaker (e.g., MFA-Conformer). However, the large number of model parameters and computational latency are common problems faced by the above approaches, which make them difficult to apply to resource-constrained edge terminals. To address this issue, this paper proposes an effective, lightweight transformer model (MCA-former) with multi-scale convolutional self-attention (MCA), which can perform multi-scale modeling and channel modeling in the temporal direction of the input with low computational cost. In addition, in the inference phase of the model, we further develop a systematic re-parameterization method to convert the multi-branch network structure into the single-path topology, effectively improving the inference speed. We investigate the performance of the MCA-former for speaker verification under the VoxCeleb1 test set. The results show that the MCA-based transformer model is more advantageous in terms of the number of parameters and inference efficiency. By applying the re-parameterization, the inference speed of the model is increased by about 30%, and the memory consumption is significantly improved.

查看原文本刊更多论文

Rep-MCA-former：用于独立于文本的说话人验证的高效多尺度卷积注意力编码器

在许多扬声器验证任务中，扬声器嵌入的质量是影响扬声器验证系统的一个重要因素。先进的扬声器嵌入提取网络旨在通过多分支网络架构捕捉更丰富的扬声器特征。最近，基于变压器编码器的说话人验证系统受到了广泛关注，由于变压器编码器能有效提取说话人的全局特征（如 MFA-Conformer），因此取得了许多令人满意的结果。然而，大量的模型参数和计算延迟是上述方法面临的共同问题，这使得它们难以应用于资源受限的边缘终端。针对这一问题，本文提出了一种有效、轻量级的变换器模型（MCA-former），它具有多尺度卷积自注意（MCA）功能，能以较低的计算成本在输入的时间方向上进行多尺度建模和信道建模。此外，在模型推理阶段，我们进一步开发了一种系统化的重参数化方法，将多分支网络结构转换为单路径拓扑结构，有效提高了推理速度。我们研究了 MCA 生成器在 VoxCeleb1 测试集下验证说话人的性能。结果表明，基于 MCA 的变换器模型在参数数量和推理效率方面更具优势。通过重新参数化，模型的推理速度提高了约 30%，内存消耗也得到了显著改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.