跨模块注意统计池在说话人验证中的应用

2023 11th International Workshop on Biometrics and Forensics (IWBF) Pub Date : 2023-04-19 DOI:10.1109/IWBF57495.2023.10157564

J. Alam, A. Fathan

{"title":"跨模块注意统计池在说话人验证中的应用","authors":"J. Alam, A. Fathan","doi":"10.1109/IWBF57495.2023.10157564","DOIUrl":null,"url":null,"abstract":"In deep learning-based speaker verification frameworks, extraction of a speaker embedding vector plays a key role. In this contribution, we propose a hybrid neural network that employs a cross-module attention pooling mechanism for the extraction of speaker discriminant utterance-level embeddings. In particular, the proposed system incorporates a 2D-Convolution Neural Network (CNN)-based feature extraction module in cascade with a frame-level network, which is composed of a fully Time Delay Neural Network (TDNN) network and a TDNN-Long Short Term Memory (TDNN-LSTM) hybrid network in a parallel manner. The proposed system also employs cross-module attention statistics pooling for aggregating the speaker information within an utterance-level context by capturing the complementarity between two parallelly connected modules. We conduct a set of experiments on the Voxceleb corpus for evaluating the performance of the proposed system and the proposed hybrid network is able to provide better results than the conventional approaches trained on the same dataset.","PeriodicalId":273412,"journal":{"name":"2023 11th International Workshop on Biometrics and Forensics (IWBF)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"On the Use of Cross-module Attention Statistics Pooling for Speaker Verification\",\"authors\":\"J. Alam, A. Fathan\",\"doi\":\"10.1109/IWBF57495.2023.10157564\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In deep learning-based speaker verification frameworks, extraction of a speaker embedding vector plays a key role. In this contribution, we propose a hybrid neural network that employs a cross-module attention pooling mechanism for the extraction of speaker discriminant utterance-level embeddings. In particular, the proposed system incorporates a 2D-Convolution Neural Network (CNN)-based feature extraction module in cascade with a frame-level network, which is composed of a fully Time Delay Neural Network (TDNN) network and a TDNN-Long Short Term Memory (TDNN-LSTM) hybrid network in a parallel manner. The proposed system also employs cross-module attention statistics pooling for aggregating the speaker information within an utterance-level context by capturing the complementarity between two parallelly connected modules. We conduct a set of experiments on the Voxceleb corpus for evaluating the performance of the proposed system and the proposed hybrid network is able to provide better results than the conventional approaches trained on the same dataset.\",\"PeriodicalId\":273412,\"journal\":{\"name\":\"2023 11th International Workshop on Biometrics and Forensics (IWBF)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 11th International Workshop on Biometrics and Forensics (IWBF)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IWBF57495.2023.10157564\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 11th International Workshop on Biometrics and Forensics (IWBF)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IWBF57495.2023.10157564","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在基于深度学习的说话人验证框架中，说话人嵌入向量的提取是关键。在这篇论文中，我们提出了一种混合神经网络，该网络采用跨模块注意力池机制来提取说话人鉴别的话语级嵌入。特别地，该系统将基于2d卷积神经网络(CNN)的特征提取模块与由全时延神经网络(TDNN)网络和TDNN-长短期记忆(TDNN- lstm)混合网络并行组成的帧级网络相结合。该系统还采用跨模块注意力统计池，通过捕获两个并行连接模块之间的互补性来聚合话语级上下文中的说话人信息。我们在Voxceleb语料库上进行了一组实验来评估所提出系统的性能，所提出的混合网络能够提供比在相同数据集上训练的传统方法更好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

On the Use of Cross-module Attention Statistics Pooling for Speaker Verification

In deep learning-based speaker verification frameworks, extraction of a speaker embedding vector plays a key role. In this contribution, we propose a hybrid neural network that employs a cross-module attention pooling mechanism for the extraction of speaker discriminant utterance-level embeddings. In particular, the proposed system incorporates a 2D-Convolution Neural Network (CNN)-based feature extraction module in cascade with a frame-level network, which is composed of a fully Time Delay Neural Network (TDNN) network and a TDNN-Long Short Term Memory (TDNN-LSTM) hybrid network in a parallel manner. The proposed system also employs cross-module attention statistics pooling for aggregating the speaker information within an utterance-level context by capturing the complementarity between two parallelly connected modules. We conduct a set of experiments on the Voxceleb corpus for evaluating the performance of the proposed system and the proposed hybrid network is able to provide better results than the conventional approaches trained on the same dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 11th International Workshop on Biometrics and Forensics (IWBF)

自引率

0.00%

发文量