Ningxin Liang, W. Xu, Chengfang Luo, Wenxiong Kang
{"title":"学习前端语音特征与原始波形的端到端说话人识别","authors":"Ningxin Liang, W. Xu, Chengfang Luo, Wenxiong Kang","doi":"10.1145/3404555.3404571","DOIUrl":null,"url":null,"abstract":"State-of-the-art deep neural network-based speaker recognition systems tend to follow the paradigm of speech feature extraction and then the speaker classifier training, namely \"divide and conquer\" approaches. These methods usually rely on fixed, handcrafted features such as Mel frequency cepstral coefficients (MFCCs) to preprocess the waveform before the classification pipeline. In this paper, inspired by the success and promising work to model a system directly from the raw speech signal for applications such as audio speech recognition, anti-spoofing and emotion recognition, we present an end-to-end speaker recognition system, combining front-end raw waveform feature extractor, back-end speaker embedding classifier and angle-based loss optimizer. Specifically, this means that the proposed frontend raw waveform feature extractor builds on a trainable alternative for MFCCs without modification of the acoustic model. And we will detail the superiority of the raw waveform feature extractor, namely utilizing the time convolution layer to reduce temporal variations aiming to adaptively learn a front-end speech feature representation by supervised training together with the rest of classification model. Our experiments, conducted on CSTR VCTK Corpus dataset, demonstrate that the proposed end-to-end speaker recognition system can achieve state-of-the-art performance compared to baseline models.","PeriodicalId":220526,"journal":{"name":"Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence","volume":"256 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning the Front-End Speech Feature with Raw Waveform for End-to-End Speaker Recognition\",\"authors\":\"Ningxin Liang, W. Xu, Chengfang Luo, Wenxiong Kang\",\"doi\":\"10.1145/3404555.3404571\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"State-of-the-art deep neural network-based speaker recognition systems tend to follow the paradigm of speech feature extraction and then the speaker classifier training, namely \\\"divide and conquer\\\" approaches. These methods usually rely on fixed, handcrafted features such as Mel frequency cepstral coefficients (MFCCs) to preprocess the waveform before the classification pipeline. In this paper, inspired by the success and promising work to model a system directly from the raw speech signal for applications such as audio speech recognition, anti-spoofing and emotion recognition, we present an end-to-end speaker recognition system, combining front-end raw waveform feature extractor, back-end speaker embedding classifier and angle-based loss optimizer. Specifically, this means that the proposed frontend raw waveform feature extractor builds on a trainable alternative for MFCCs without modification of the acoustic model. And we will detail the superiority of the raw waveform feature extractor, namely utilizing the time convolution layer to reduce temporal variations aiming to adaptively learn a front-end speech feature representation by supervised training together with the rest of classification model. Our experiments, conducted on CSTR VCTK Corpus dataset, demonstrate that the proposed end-to-end speaker recognition system can achieve state-of-the-art performance compared to baseline models.\",\"PeriodicalId\":220526,\"journal\":{\"name\":\"Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence\",\"volume\":\"256 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-04-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3404555.3404571\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3404555.3404571","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Learning the Front-End Speech Feature with Raw Waveform for End-to-End Speaker Recognition
State-of-the-art deep neural network-based speaker recognition systems tend to follow the paradigm of speech feature extraction and then the speaker classifier training, namely "divide and conquer" approaches. These methods usually rely on fixed, handcrafted features such as Mel frequency cepstral coefficients (MFCCs) to preprocess the waveform before the classification pipeline. In this paper, inspired by the success and promising work to model a system directly from the raw speech signal for applications such as audio speech recognition, anti-spoofing and emotion recognition, we present an end-to-end speaker recognition system, combining front-end raw waveform feature extractor, back-end speaker embedding classifier and angle-based loss optimizer. Specifically, this means that the proposed frontend raw waveform feature extractor builds on a trainable alternative for MFCCs without modification of the acoustic model. And we will detail the superiority of the raw waveform feature extractor, namely utilizing the time convolution layer to reduce temporal variations aiming to adaptively learn a front-end speech feature representation by supervised training together with the rest of classification model. Our experiments, conducted on CSTR VCTK Corpus dataset, demonstrate that the proposed end-to-end speaker recognition system can achieve state-of-the-art performance compared to baseline models.