基于掩模交叉自关注编码的说话人嵌入方法

IF 0.3 Q4 ACOUSTICS

Journal of the Acoustical Society of Korea Pub Date : 2020-09-01 DOI:10.7776/ASK.2020.39.5.497

Soonshin Seo, Ji-Hwan Kim

{"title":"基于掩模交叉自关注编码的说话人嵌入方法","authors":"Soonshin Seo, Ji-Hwan Kim","doi":"10.7776/ASK.2020.39.5.497","DOIUrl":null,"url":null,"abstract":"Constructing speaker embeddings in speaker verification is an important issue. In general, a self-attention mechanism has been applied for speaker embedding encoding. Previous studies focused on training the self-attention in a high-level layer, such as the last pooling layer. In this case, the effect of low-level layers is not well represented in the speaker embedding encoding. In this study, we propose Masked Cross Self-Attentive Encoding (MCSAE) using ResNet. It focuses on training the features of both high-level and low-level layers. Based on multi-layer aggregation, the output features of each residual layer are used for the MCSAE. In the MCSAE, the interdependence of each input features is trained by cross self-attention module. A random masking regularization module is also applied to prevent overfitting problem. The MCSAE enhances the weight of frames representing the speaker information. Then, the output features are concatenated and encoded in the speaker embedding. Therefore, a more informative speaker embedding is encoded by using the MCSAE. The experimental results showed an equal error rate of 2.63 % using the VoxCeleb1 evaluation dataset. It improved performance compared with the previous self-attentive encoding and state-of-the-art methods.","PeriodicalId":42689,"journal":{"name":"Journal of the Acoustical Society of Korea","volume":"39 1","pages":"497-504"},"PeriodicalIF":0.3000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Masked cross self-attentive encoding based speaker embedding for speaker verification\",\"authors\":\"Soonshin Seo, Ji-Hwan Kim\",\"doi\":\"10.7776/ASK.2020.39.5.497\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Constructing speaker embeddings in speaker verification is an important issue. In general, a self-attention mechanism has been applied for speaker embedding encoding. Previous studies focused on training the self-attention in a high-level layer, such as the last pooling layer. In this case, the effect of low-level layers is not well represented in the speaker embedding encoding. In this study, we propose Masked Cross Self-Attentive Encoding (MCSAE) using ResNet. It focuses on training the features of both high-level and low-level layers. Based on multi-layer aggregation, the output features of each residual layer are used for the MCSAE. In the MCSAE, the interdependence of each input features is trained by cross self-attention module. A random masking regularization module is also applied to prevent overfitting problem. The MCSAE enhances the weight of frames representing the speaker information. Then, the output features are concatenated and encoded in the speaker embedding. Therefore, a more informative speaker embedding is encoded by using the MCSAE. The experimental results showed an equal error rate of 2.63 % using the VoxCeleb1 evaluation dataset. It improved performance compared with the previous self-attentive encoding and state-of-the-art methods.\",\"PeriodicalId\":42689,\"journal\":{\"name\":\"Journal of the Acoustical Society of Korea\",\"volume\":\"39 1\",\"pages\":\"497-504\"},\"PeriodicalIF\":0.3000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the Acoustical Society of Korea\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.7776/ASK.2020.39.5.497\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the Acoustical Society of Korea","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7776/ASK.2020.39.5.497","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

摘要

说话人嵌入的构建是说话人验证中的一个重要问题。一般来说，自注意机制被用于说话人嵌入编码。以往的研究主要集中在高层次的自我注意训练，如最后一层池化。在这种情况下，低层次的效果不能很好地体现在说话人嵌入编码中。在这项研究中，我们提出了基于ResNet的掩蔽交叉自关注编码(MCSAE)。它侧重于训练高级层和低级层的特征。基于多层聚合，将各残差层的输出特征用于MCSAE。在MCSAE中，通过交叉自注意模块训练各输入特征的相互依赖性。随机屏蔽正则化模块也用于防止过拟合问题。MCSAE增强了代表说话人信息的帧的权重。然后，将输出特征串接并编码到说话人嵌入中。因此，使用MCSAE对更有信息量的说话人嵌入进行编码。实验结果表明，使用VoxCeleb1评价数据集，错误率为2.63%。与以前的自关注编码和最先进的方法相比，它提高了性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Masked cross self-attentive encoding based speaker embedding for speaker verification

Constructing speaker embeddings in speaker verification is an important issue. In general, a self-attention mechanism has been applied for speaker embedding encoding. Previous studies focused on training the self-attention in a high-level layer, such as the last pooling layer. In this case, the effect of low-level layers is not well represented in the speaker embedding encoding. In this study, we propose Masked Cross Self-Attentive Encoding (MCSAE) using ResNet. It focuses on training the features of both high-level and low-level layers. Based on multi-layer aggregation, the output features of each residual layer are used for the MCSAE. In the MCSAE, the interdependence of each input features is trained by cross self-attention module. A random masking regularization module is also applied to prevent overfitting problem. The MCSAE enhances the weight of frames representing the speaker information. Then, the output features are concatenated and encoded in the speaker embedding. Therefore, a more informative speaker embedding is encoded by using the MCSAE. The experimental results showed an equal error rate of 2.63 % using the VoxCeleb1 evaluation dataset. It improved performance compared with the previous self-attentive encoding and state-of-the-art methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the Acoustical Society of Korea ACOUSTICS-

CiteScore

0.60

自引率

50.00%

发文量