Deep Learning Based Speaker Recognition System with CNN and LSTM Techniques

2022 Interdisciplinary Research in Technology and Management (IRTM) Pub Date : 2022-02-24 DOI:10.1109/irtm54583.2022.9791766

Noshin Nirvana Prachi, Faisal Mahmud Nahiyan, Md. Habibullah, R. Khan

{"title":"Deep Learning Based Speaker Recognition System with CNN and LSTM Techniques","authors":"Noshin Nirvana Prachi, Faisal Mahmud Nahiyan, Md. Habibullah, R. Khan","doi":"10.1109/irtm54583.2022.9791766","DOIUrl":null,"url":null,"abstract":"Speaker recognition is an advanced method to identify a person from the biometric characteristics of speaking voice samples. Speaker recognition has become a vastly popular and useful research subject with countless essential applications in security, assistance, replication, authentication, automation, and verification. Many techniques are implemented using deep learning and neural network concepts and various datasets for speaker verification and identification. The primary goal of this work is to create improved robust techniques of speaker recognition to identify audio and enhance accuracy to human levels of comprehension. TIMIT and LibriSpeech datasets are used in this paper to develop an efficient automatic speaker recognition system. This work focuses on using MFCC to transform audio to spectrograms without losing the essential features of the audio file in question. We have used a closed set and an open set implementation procedure on these datasets. The closed set implementation uses a standard machine learning convention of utilizing the same datasets for training and testing, leading to higher accuracy. On the other hand, the open set implementation uses one dataset to train and another to test on each occasion. The accuracy, in this case, turned out to be relatively lower. On each dataset, CNN and LSTM deep learning techniques have been used to identify the sound, leading to the observation that implementing CNN resulted in a more significant accuracy.","PeriodicalId":426354,"journal":{"name":"2022 Interdisciplinary Research in Technology and Management (IRTM)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Interdisciplinary Research in Technology and Management (IRTM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/irtm54583.2022.9791766","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Speaker recognition is an advanced method to identify a person from the biometric characteristics of speaking voice samples. Speaker recognition has become a vastly popular and useful research subject with countless essential applications in security, assistance, replication, authentication, automation, and verification. Many techniques are implemented using deep learning and neural network concepts and various datasets for speaker verification and identification. The primary goal of this work is to create improved robust techniques of speaker recognition to identify audio and enhance accuracy to human levels of comprehension. TIMIT and LibriSpeech datasets are used in this paper to develop an efficient automatic speaker recognition system. This work focuses on using MFCC to transform audio to spectrograms without losing the essential features of the audio file in question. We have used a closed set and an open set implementation procedure on these datasets. The closed set implementation uses a standard machine learning convention of utilizing the same datasets for training and testing, leading to higher accuracy. On the other hand, the open set implementation uses one dataset to train and another to test on each occasion. The accuracy, in this case, turned out to be relatively lower. On each dataset, CNN and LSTM deep learning techniques have been used to identify the sound, leading to the observation that implementing CNN resulted in a more significant accuracy.

查看原文本刊更多论文

基于CNN和LSTM技术的深度学习说话人识别系统

说话人识别是一种利用说话语音样本的生物特征来识别人的先进方法。说话人识别已经成为一个非常流行和有用的研究课题，在安全、辅助、复制、身份验证、自动化和验证等方面有着无数的基本应用。许多技术使用深度学习和神经网络概念以及用于说话人验证和识别的各种数据集来实现。这项工作的主要目标是创建改进的稳健的说话人识别技术，以识别音频并将准确性提高到人类的理解水平。本文利用TIMIT和librisspeech数据集开发了一个高效的自动说话人识别系统。这项工作的重点是使用MFCC将音频转换为频谱图，而不会丢失音频文件的基本特征。我们在这些数据集上分别使用了闭集和开集实现程序。闭集实现使用标准的机器学习约定，利用相同的数据集进行训练和测试，从而提高准确性。另一方面，开放集实现在每种情况下使用一个数据集进行训练和另一个数据集进行测试。在这种情况下，准确率相对较低。在每个数据集上，使用CNN和LSTM深度学习技术来识别声音，从而观察到实施CNN会产生更显著的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 Interdisciplinary Research in Technology and Management (IRTM)

自引率

0.00%

发文量