基于嵌入机制的说话人识别特征融合方法

International Conference on Signal Processing and Communication Security Pub Date : 2022-11-02 DOI:10.1117/12.2655318

JiangKun Zhao, Houpan Zhou, Honglei Liu, Yonghai Du

{"title":"基于嵌入机制的说话人识别特征融合方法","authors":"JiangKun Zhao, Houpan Zhou, Honglei Liu, Yonghai Du","doi":"10.1117/12.2655318","DOIUrl":null,"url":null,"abstract":"Speaker recognition is a technology that verifies the identity of a person by his or her voice. Different feature parameters have different potential information in speaker recognition. In order to solve the problem that a single feature parameter cannot fully represent a speaker's identity, this paper proposes a feature fusion approach based on embedding mechanism. The fusion features adopted in our approach are filter bank coefficients (Fbank) and mel frequency cepstrum coefficients (MFCC). Potential and complementary information in two features can be obtained by a neural network model, which takes our embedded features as inputs. The d-vector output of the neural network model is classified using the Softmax loss function and optimized using the generalized end-to-end loss function. Both of the most common models, long short term memory network (LSTM) and bi-directional long short term memory network (BiLSTM), are used as our testbed. Results show that, by using our proposed feature fusion approach, the performance of both models are improved. In particular, the minimum equal error rate is 4.17% under the BiLSTM model, compared with the single MFCC or Fbank feature, which are reduced by 72.2% and 28.4% respectively.","PeriodicalId":105577,"journal":{"name":"International Conference on Signal Processing and Communication Security","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feature fusion method for speaker recognition based on embedding mechanism\",\"authors\":\"JiangKun Zhao, Houpan Zhou, Honglei Liu, Yonghai Du\",\"doi\":\"10.1117/12.2655318\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speaker recognition is a technology that verifies the identity of a person by his or her voice. Different feature parameters have different potential information in speaker recognition. In order to solve the problem that a single feature parameter cannot fully represent a speaker's identity, this paper proposes a feature fusion approach based on embedding mechanism. The fusion features adopted in our approach are filter bank coefficients (Fbank) and mel frequency cepstrum coefficients (MFCC). Potential and complementary information in two features can be obtained by a neural network model, which takes our embedded features as inputs. The d-vector output of the neural network model is classified using the Softmax loss function and optimized using the generalized end-to-end loss function. Both of the most common models, long short term memory network (LSTM) and bi-directional long short term memory network (BiLSTM), are used as our testbed. Results show that, by using our proposed feature fusion approach, the performance of both models are improved. In particular, the minimum equal error rate is 4.17% under the BiLSTM model, compared with the single MFCC or Fbank feature, which are reduced by 72.2% and 28.4% respectively.\",\"PeriodicalId\":105577,\"journal\":{\"name\":\"International Conference on Signal Processing and Communication Security\",\"volume\":\"114 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Signal Processing and Communication Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2655318\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Signal Processing and Communication Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2655318","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

说话人识别是一种通过说话人的声音来验证其身份的技术。不同的特征参数在说话人识别中具有不同的潜在信息。为了解决单个特征参数不能完全代表说话人身份的问题，本文提出了一种基于嵌入机制的特征融合方法。该方法采用的融合特征是滤波器组系数(Fbank)和频率倒谱系数(MFCC)。利用神经网络模型，将我们的嵌入特征作为输入，获取两个特征之间的势信息和互补信息。神经网络模型的d向量输出使用Softmax损失函数进行分类，并使用广义端到端损失函数进行优化。我们使用了两种最常见的模型，长短期记忆网络(LSTM)和双向长短期记忆网络(BiLSTM)作为我们的测试平台。结果表明，采用我们提出的特征融合方法，两种模型的性能都得到了提高。其中，BiLSTM模型下的最小等错误率为4.17%，与单个MFCC或Fbank特征相比，分别降低了72.2%和28.4%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Feature fusion method for speaker recognition based on embedding mechanism

Speaker recognition is a technology that verifies the identity of a person by his or her voice. Different feature parameters have different potential information in speaker recognition. In order to solve the problem that a single feature parameter cannot fully represent a speaker's identity, this paper proposes a feature fusion approach based on embedding mechanism. The fusion features adopted in our approach are filter bank coefficients (Fbank) and mel frequency cepstrum coefficients (MFCC). Potential and complementary information in two features can be obtained by a neural network model, which takes our embedded features as inputs. The d-vector output of the neural network model is classified using the Softmax loss function and optimized using the generalized end-to-end loss function. Both of the most common models, long short term memory network (LSTM) and bi-directional long short term memory network (BiLSTM), are used as our testbed. Results show that, by using our proposed feature fusion approach, the performance of both models are improved. In particular, the minimum equal error rate is 4.17% under the BiLSTM model, compared with the single MFCC or Fbank feature, which are reduced by 72.2% and 28.4% respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Signal Processing and Communication Security

自引率

0.00%

发文量