Banala Saritha, Mohammad Azharuddin Laskar, R. Laskar, Madhuchhanda Choudhury
{"title":"基于原始波形的深度神经网络说话人识别","authors":"Banala Saritha, Mohammad Azharuddin Laskar, R. Laskar, Madhuchhanda Choudhury","doi":"10.1109/SILCON55242.2022.10028890","DOIUrl":null,"url":null,"abstract":"Deep learning is attracting tremendous prominence as an adequate replacement for i-vectors in the speaker identification task. Deep neural networks have attained much attention in the end-to-end (E2E) speaker identification domain. Earlier, DNN trained on handcrafted speech features like Mel-filter banks and Mel-frequency cepstral coefficients. Later, as the raw speech signal is lossless, processing raw waveforms have become an active research area in E2E speaker identification, automatic music tagging, and speech recognition fields. Convolutional neural networks (CNNs) have recently shown promising results when fed directly with raw speech samples. CNN analyzes waveforms to discover low-level speech representations rather than conventional handcrafted features, which may enable the system to handle speaker properties like pitch and formants more efficiently. An efficient design of neural networks is vital to achieving this. The CNN architecture proposed in this paper promotes the deep convolutional layers to develop more efficient filters for end-to-end speaker identification systems. The proposed architecture converges quickly and outperforms conventional CNN on raw waveforms. This research work has been tested on the Librispeech dataset and improved the Speaker identification accuracy by 10% and decreased the validation loss by 32%.","PeriodicalId":183947,"journal":{"name":"2022 IEEE Silchar Subsection Conference (SILCON)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Raw Waveform Based Speaker Identification Using Deep Neural Networks\",\"authors\":\"Banala Saritha, Mohammad Azharuddin Laskar, R. Laskar, Madhuchhanda Choudhury\",\"doi\":\"10.1109/SILCON55242.2022.10028890\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning is attracting tremendous prominence as an adequate replacement for i-vectors in the speaker identification task. Deep neural networks have attained much attention in the end-to-end (E2E) speaker identification domain. Earlier, DNN trained on handcrafted speech features like Mel-filter banks and Mel-frequency cepstral coefficients. Later, as the raw speech signal is lossless, processing raw waveforms have become an active research area in E2E speaker identification, automatic music tagging, and speech recognition fields. Convolutional neural networks (CNNs) have recently shown promising results when fed directly with raw speech samples. CNN analyzes waveforms to discover low-level speech representations rather than conventional handcrafted features, which may enable the system to handle speaker properties like pitch and formants more efficiently. An efficient design of neural networks is vital to achieving this. The CNN architecture proposed in this paper promotes the deep convolutional layers to develop more efficient filters for end-to-end speaker identification systems. The proposed architecture converges quickly and outperforms conventional CNN on raw waveforms. This research work has been tested on the Librispeech dataset and improved the Speaker identification accuracy by 10% and decreased the validation loss by 32%.\",\"PeriodicalId\":183947,\"journal\":{\"name\":\"2022 IEEE Silchar Subsection Conference (SILCON)\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE Silchar Subsection Conference (SILCON)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SILCON55242.2022.10028890\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Silchar Subsection Conference (SILCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SILCON55242.2022.10028890","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Raw Waveform Based Speaker Identification Using Deep Neural Networks
Deep learning is attracting tremendous prominence as an adequate replacement for i-vectors in the speaker identification task. Deep neural networks have attained much attention in the end-to-end (E2E) speaker identification domain. Earlier, DNN trained on handcrafted speech features like Mel-filter banks and Mel-frequency cepstral coefficients. Later, as the raw speech signal is lossless, processing raw waveforms have become an active research area in E2E speaker identification, automatic music tagging, and speech recognition fields. Convolutional neural networks (CNNs) have recently shown promising results when fed directly with raw speech samples. CNN analyzes waveforms to discover low-level speech representations rather than conventional handcrafted features, which may enable the system to handle speaker properties like pitch and formants more efficiently. An efficient design of neural networks is vital to achieving this. The CNN architecture proposed in this paper promotes the deep convolutional layers to develop more efficient filters for end-to-end speaker identification systems. The proposed architecture converges quickly and outperforms conventional CNN on raw waveforms. This research work has been tested on the Librispeech dataset and improved the Speaker identification accuracy by 10% and decreased the validation loss by 32%.