JiangKun Zhao, Houpan Zhou, Honglei Liu, Yonghai Du
{"title":"基于嵌入机制的说话人识别特征融合方法","authors":"JiangKun Zhao, Houpan Zhou, Honglei Liu, Yonghai Du","doi":"10.1117/12.2655318","DOIUrl":null,"url":null,"abstract":"Speaker recognition is a technology that verifies the identity of a person by his or her voice. Different feature parameters have different potential information in speaker recognition. In order to solve the problem that a single feature parameter cannot fully represent a speaker's identity, this paper proposes a feature fusion approach based on embedding mechanism. The fusion features adopted in our approach are filter bank coefficients (Fbank) and mel frequency cepstrum coefficients (MFCC). Potential and complementary information in two features can be obtained by a neural network model, which takes our embedded features as inputs. The d-vector output of the neural network model is classified using the Softmax loss function and optimized using the generalized end-to-end loss function. Both of the most common models, long short term memory network (LSTM) and bi-directional long short term memory network (BiLSTM), are used as our testbed. Results show that, by using our proposed feature fusion approach, the performance of both models are improved. In particular, the minimum equal error rate is 4.17% under the BiLSTM model, compared with the single MFCC or Fbank feature, which are reduced by 72.2% and 28.4% respectively.","PeriodicalId":105577,"journal":{"name":"International Conference on Signal Processing and Communication Security","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feature fusion method for speaker recognition based on embedding mechanism\",\"authors\":\"JiangKun Zhao, Houpan Zhou, Honglei Liu, Yonghai Du\",\"doi\":\"10.1117/12.2655318\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speaker recognition is a technology that verifies the identity of a person by his or her voice. Different feature parameters have different potential information in speaker recognition. In order to solve the problem that a single feature parameter cannot fully represent a speaker's identity, this paper proposes a feature fusion approach based on embedding mechanism. The fusion features adopted in our approach are filter bank coefficients (Fbank) and mel frequency cepstrum coefficients (MFCC). Potential and complementary information in two features can be obtained by a neural network model, which takes our embedded features as inputs. The d-vector output of the neural network model is classified using the Softmax loss function and optimized using the generalized end-to-end loss function. Both of the most common models, long short term memory network (LSTM) and bi-directional long short term memory network (BiLSTM), are used as our testbed. Results show that, by using our proposed feature fusion approach, the performance of both models are improved. In particular, the minimum equal error rate is 4.17% under the BiLSTM model, compared with the single MFCC or Fbank feature, which are reduced by 72.2% and 28.4% respectively.\",\"PeriodicalId\":105577,\"journal\":{\"name\":\"International Conference on Signal Processing and Communication Security\",\"volume\":\"114 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Signal Processing and Communication Security\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2655318\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Signal Processing and Communication Security","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2655318","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Feature fusion method for speaker recognition based on embedding mechanism
Speaker recognition is a technology that verifies the identity of a person by his or her voice. Different feature parameters have different potential information in speaker recognition. In order to solve the problem that a single feature parameter cannot fully represent a speaker's identity, this paper proposes a feature fusion approach based on embedding mechanism. The fusion features adopted in our approach are filter bank coefficients (Fbank) and mel frequency cepstrum coefficients (MFCC). Potential and complementary information in two features can be obtained by a neural network model, which takes our embedded features as inputs. The d-vector output of the neural network model is classified using the Softmax loss function and optimized using the generalized end-to-end loss function. Both of the most common models, long short term memory network (LSTM) and bi-directional long short term memory network (BiLSTM), are used as our testbed. Results show that, by using our proposed feature fusion approach, the performance of both models are improved. In particular, the minimum equal error rate is 4.17% under the BiLSTM model, compared with the single MFCC or Fbank feature, which are reduced by 72.2% and 28.4% respectively.