{"title":"Extracting sub-glottal and Supra-glottal features from MFCC using convolutional neural networks for speaker identification in degraded audio signals","authors":"Anurag Chowdhury, A. Ross","doi":"10.1109/BTAS.2017.8272748","DOIUrl":null,"url":null,"abstract":"We present a deep learning based algorithm for speaker recognition from degraded audio signals. We use the commonly employed Mel-Frequency Cepstral Coefficients (MFCC) for representing the audio signals. A convolutional neural network (CNN) based on 1D filters, rather than 2D filters, is then designed. The filters in the CNN are designed to learn inter-dependency between cepstral coefficients extracted from audio frames of fixed temporal expanse. Our approach aims at extracting speaker dependent features, like Sub-glottal and Supra-glottal features, of the human speech production apparatus for identifying speakers from degraded audio signals. The performance of the proposed method is compared against existing baseline schemes on both synthetically and naturally corrupted speech data. Experiments convey the efficacy of the proposed architecture for speaker recognition.","PeriodicalId":372008,"journal":{"name":"2017 IEEE International Joint Conference on Biometrics (IJCB)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Joint Conference on Biometrics (IJCB)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BTAS.2017.8272748","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
We present a deep learning based algorithm for speaker recognition from degraded audio signals. We use the commonly employed Mel-Frequency Cepstral Coefficients (MFCC) for representing the audio signals. A convolutional neural network (CNN) based on 1D filters, rather than 2D filters, is then designed. The filters in the CNN are designed to learn inter-dependency between cepstral coefficients extracted from audio frames of fixed temporal expanse. Our approach aims at extracting speaker dependent features, like Sub-glottal and Supra-glottal features, of the human speech production apparatus for identifying speakers from degraded audio signals. The performance of the proposed method is compared against existing baseline schemes on both synthetically and naturally corrupted speech data. Experiments convey the efficacy of the proposed architecture for speaker recognition.