{"title":"Multi-Task Learning Based End-to-End Speaker Recognition","authors":"Yuxuan Pan, Weiqiang Zhang","doi":"10.1145/3372806.3372818","DOIUrl":null,"url":null,"abstract":"Recently, there has been an increasing interest in end-to-end speaker recognition that directly take raw speech waveform as input without any hand-crafted features such as FBANK and MFCC. SincNet is a recently developed novel convolutional neural network (CNN) architecture in which the filters in the first convolutional layer are set to band-pass filters (sinc functions). Experiments show that SincNet achieves a significant decrease in frame error rate (FER) than traditional CNNs and DNNs.\n In this paper we demonstrate how to improve the performance of SincNet using Multi-Task learning (MTL). In the proposed Sinc- Net architecture, besides the main task (speaker recognition), a phoneme recognition task is employed as an auxiliary task. The network uses sinc layers and convolutional layers as shared layers to improve the extensiveness of the network, and the outputs of shared layers are fed into two different sets of full-connected layers for classification. Our experiments, conducted on TIMIT corpora, show that the proposed architecture SincNet-MTL performs better than standard SincNet architecture in both classification error rates (CER) and convergence rate.","PeriodicalId":340004,"journal":{"name":"International Conference on Signal Processing and Machine Learning","volume":"90 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Signal Processing and Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372806.3372818","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Recently, there has been an increasing interest in end-to-end speaker recognition that directly take raw speech waveform as input without any hand-crafted features such as FBANK and MFCC. SincNet is a recently developed novel convolutional neural network (CNN) architecture in which the filters in the first convolutional layer are set to band-pass filters (sinc functions). Experiments show that SincNet achieves a significant decrease in frame error rate (FER) than traditional CNNs and DNNs.
In this paper we demonstrate how to improve the performance of SincNet using Multi-Task learning (MTL). In the proposed Sinc- Net architecture, besides the main task (speaker recognition), a phoneme recognition task is employed as an auxiliary task. The network uses sinc layers and convolutional layers as shared layers to improve the extensiveness of the network, and the outputs of shared layers are fed into two different sets of full-connected layers for classification. Our experiments, conducted on TIMIT corpora, show that the proposed architecture SincNet-MTL performs better than standard SincNet architecture in both classification error rates (CER) and convergence rate.