Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-12 DOI:arxiv-2409.07770

Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han

{"title":"Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification","authors":"Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han","doi":"arxiv-2409.07770","DOIUrl":null,"url":null,"abstract":"Recent advancements in automatic speaker verification (ASV) studies have been\nachieved by leveraging large-scale pretrained networks. In this study, we\nanalyze the approaches toward such a paradigm and underline the significance of\ninterlayer information processing as a result. Accordingly, we present a novel\napproach for exploiting the multilayered nature of pretrained models for ASV,\nwhich comprises a layer/frame-level network and two steps of pooling\narchitectures for each layer and frame axis. Specifically, we let convolutional\narchitecture directly processes a stack of layer outputs.Then, we present a\nchannel attention-based scheme of gauging layer significance and squeeze the\nlayer level with the most representative value. Finally, attentive statistics\nover frame-level representations yield a single vector speaker embedding.\nComparative experiments are designed using versatile data environments and\ndiverse pretraining models to validate the proposed approach. The experimental\nresults demonstrate the stability of the approach using multi-layer outputs in\nleveraging pretrained architectures. Then, we verify the superiority of the\nproposed ASV backend structure, which involves layer-wise operations, in terms\nof performance improvement along with cost efficiency compared to the\nconventional method. The ablation study shows how the proposed interlayer\nprocessing aids in maximizing the advantage of utilizing pretrained models.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"61 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07770","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advancements in automatic speaker verification (ASV) studies have been achieved by leveraging large-scale pretrained networks. In this study, we analyze the approaches toward such a paradigm and underline the significance of interlayer information processing as a result. Accordingly, we present a novel approach for exploiting the multilayered nature of pretrained models for ASV, which comprises a layer/frame-level network and two steps of pooling architectures for each layer and frame axis. Specifically, we let convolutional architecture directly processes a stack of layer outputs.Then, we present a channel attention-based scheme of gauging layer significance and squeeze the layer level with the most representative value. Finally, attentive statistics over frame-level representations yield a single vector speaker embedding. Comparative experiments are designed using versatile data environments and diverse pretraining models to validate the proposed approach. The experimental results demonstrate the stability of the approach using multi-layer outputs in leveraging pretrained architectures. Then, we verify the superiority of the proposed ASV backend structure, which involves layer-wise operations, in terms of performance improvement along with cost efficiency compared to the conventional method. The ablation study shows how the proposed interlayer processing aids in maximizing the advantage of utilizing pretrained models.

查看原文本刊更多论文

从预训练模型中提取多层特征的通用汇集法用于扬声器验证

通过利用大规模预训练网络，说话人自动验证（ASV）研究取得了最新进展。在本研究中，我们分析了实现这种模式的方法，并强调了层间信息处理的重要性。因此，我们提出了一种利用预训练模型的多层特性进行 ASV 的新方法，它包括一个层/帧级网络和针对每个层和帧轴的两步池化架构。具体来说，我们让卷积架构直接处理层输出的堆叠。然后，我们提出了一种基于通道注意力的方案来衡量层的重要性，并挤压出最具代表性值的层级。最后，通过对帧级表征的注意统计，得出单个矢量的扬声器嵌入。我们设计了多种数据环境和不同的预训练模型来验证所提出的方法。实验结果表明，在杠杆化预训练架构中使用多层输出的方法具有稳定性。然后，我们验证了所提出的 ASV 后端结构的优越性，与传统方法相比，该结构涉及分层操作，在提高性能的同时还节约了成本。消融研究表明，所提出的层间处理方法有助于最大限度地发挥利用预训练模型的优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量