基于嵌入机制的说话人识别特征融合方法

JiangKun Zhao, Houpan Zhou, Honglei Liu, Yonghai Du
{"title":"基于嵌入机制的说话人识别特征融合方法","authors":"JiangKun Zhao, Houpan Zhou, Honglei Liu, Yonghai Du","doi":"10.1117/12.2655318","DOIUrl":null,"url":null,"abstract":"Speaker recognition is a technology that verifies the identity of a person by his or her voice. Different feature parameters have different potential information in speaker recognition. In order to solve the problem that a single feature parameter cannot fully represent a speaker's identity, this paper proposes a feature fusion approach based on embedding mechanism. The fusion features adopted in our approach are filter bank coefficients (Fbank) and mel frequency cepstrum coefficients (MFCC). Potential and complementary information in two features can be obtained by a neural network model, which takes our embedded features as inputs. The d-vector output of the neural network model is classified using the Softmax loss function and optimized using the generalized end-to-end loss function. Both of the most common models, long short term memory network (LSTM) and bi-directional long short term memory network (BiLSTM), are used as our testbed. Results show that, by using our proposed feature fusion approach, the performance of both models are improved. In particular, the minimum equal error rate is 4.17% under the BiLSTM model, compared with the single MFCC or Fbank feature, which are reduced by 72.2% and 28.4% respectively.","PeriodicalId":105577,"journal":{"name":"International Conference on Signal Processing and Communication Security","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feature fusion method for speaker recognition based on embedding mechanism\",\"authors\":\"JiangKun Zhao, Houpan Zhou, Honglei Liu, Yonghai Du\",\"doi\":\"10.1117/12.2655318\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speaker recognition is a technology that verifies the identity of a person by his or her voice. Different feature parameters have different potential information in speaker recognition. In order to solve the problem that a single feature parameter cannot fully represent a speaker's identity, this paper proposes a feature fusion approach based on embedding mechanism. The fusion features adopted in our approach are filter bank coefficients (Fbank) and mel frequency cepstrum coefficients (MFCC). Potential and complementary information in two features can be obtained by a neural network model, which takes our embedded features as inputs. The d-vector output of the neural network model is classified using the Softmax loss function and optimized using the generalized end-to-end loss function. Both of the most common models, long short term memory network (LSTM) and bi-directional long short term memory network (BiLSTM), are used as our testbed. Results show that, by using our proposed feature fusion approach, the performance of both models are improved. In particular, the minimum equal error rate is 4.17% under the BiLSTM model, compared with the single MFCC or Fbank feature, which are reduced by 72.2% and 28.4% respectively.\",\"PeriodicalId\":105577,\"journal\":{\"name\":\"International Conference on Signal Processing and Communication Security\",\"volume\":\"114 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Signal Processing and Communication Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2655318\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Signal Processing and Communication Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2655318","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

说话人识别是一种通过说话人的声音来验证其身份的技术。不同的特征参数在说话人识别中具有不同的潜在信息。为了解决单个特征参数不能完全代表说话人身份的问题,本文提出了一种基于嵌入机制的特征融合方法。该方法采用的融合特征是滤波器组系数(Fbank)和频率倒谱系数(MFCC)。利用神经网络模型,将我们的嵌入特征作为输入,获取两个特征之间的势信息和互补信息。神经网络模型的d向量输出使用Softmax损失函数进行分类,并使用广义端到端损失函数进行优化。我们使用了两种最常见的模型,长短期记忆网络(LSTM)和双向长短期记忆网络(BiLSTM)作为我们的测试平台。结果表明,采用我们提出的特征融合方法,两种模型的性能都得到了提高。其中,BiLSTM模型下的最小等错误率为4.17%,与单个MFCC或Fbank特征相比,分别降低了72.2%和28.4%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Feature fusion method for speaker recognition based on embedding mechanism
Speaker recognition is a technology that verifies the identity of a person by his or her voice. Different feature parameters have different potential information in speaker recognition. In order to solve the problem that a single feature parameter cannot fully represent a speaker's identity, this paper proposes a feature fusion approach based on embedding mechanism. The fusion features adopted in our approach are filter bank coefficients (Fbank) and mel frequency cepstrum coefficients (MFCC). Potential and complementary information in two features can be obtained by a neural network model, which takes our embedded features as inputs. The d-vector output of the neural network model is classified using the Softmax loss function and optimized using the generalized end-to-end loss function. Both of the most common models, long short term memory network (LSTM) and bi-directional long short term memory network (BiLSTM), are used as our testbed. Results show that, by using our proposed feature fusion approach, the performance of both models are improved. In particular, the minimum equal error rate is 4.17% under the BiLSTM model, compared with the single MFCC or Fbank feature, which are reduced by 72.2% and 28.4% respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信