Deep Learning for Robust Speech Command Recognition Using Convolutional Neural Networks (CNN)

Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications Pub Date : 2022-11-22 DOI:10.1145/3575882.3575902

Zahra Cantiabela, H. Pardede, Vicky Zilvan, W. Sulandari, R. S. Yuwana, A. A. Supianto, Dikdik Krisnandi

{"title":"Deep Learning for Robust Speech Command Recognition Using Convolutional Neural Networks (CNN)","authors":"Zahra Cantiabela, H. Pardede, Vicky Zilvan, W. Sulandari, R. S. Yuwana, A. A. Supianto, Dikdik Krisnandi","doi":"10.1145/3575882.3575902","DOIUrl":null,"url":null,"abstract":"The rapid development of mobile devices has made human-computer interaction through voice increasingly popular and effective. This condition is made possible by the rapid growth of Automatic Speech Recognition (ASR) technologies. ASR can convert human speech signals into text and interpret it as a command for computer systems to perform. One of the remaining challenges for ASR is that the system’s performance can degrade significantly in noisy environments. This research aims to build a speech recognition system capable of recognizing human speech in clean and noisy conditions, enabling the system to recognize speech commands even in noisy conditions. Deep Neural Network (DNN) methods are the dominant methods for ASR. But, there is increasing interest in using a convolutional neural network (CNN) instead. In this study, we develop CNN-based architectures for robust ASR for speech commands. We explore various depths of CNN’s layers of CNN architecture to improve a robust speech recognition system. We also optimized the best model using early stopping and two types of optimizers, i.e., Adam and SGD (Stochastic gradient descent). Our experiment shows that CNN exhibited an accuracy of 90.64%, while the DNN model exhibited 86.74% accuracy in clean conditions. In noisy conditions, an increasing number of CNN layers improves ASR’s robustness. The CNN method achieves 77.38% accuracy in clean conditions and 87.59% in noisy conditions.","PeriodicalId":367340,"journal":{"name":"Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3575882.3575902","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The rapid development of mobile devices has made human-computer interaction through voice increasingly popular and effective. This condition is made possible by the rapid growth of Automatic Speech Recognition (ASR) technologies. ASR can convert human speech signals into text and interpret it as a command for computer systems to perform. One of the remaining challenges for ASR is that the system’s performance can degrade significantly in noisy environments. This research aims to build a speech recognition system capable of recognizing human speech in clean and noisy conditions, enabling the system to recognize speech commands even in noisy conditions. Deep Neural Network (DNN) methods are the dominant methods for ASR. But, there is increasing interest in using a convolutional neural network (CNN) instead. In this study, we develop CNN-based architectures for robust ASR for speech commands. We explore various depths of CNN’s layers of CNN architecture to improve a robust speech recognition system. We also optimized the best model using early stopping and two types of optimizers, i.e., Adam and SGD (Stochastic gradient descent). Our experiment shows that CNN exhibited an accuracy of 90.64%, while the DNN model exhibited 86.74% accuracy in clean conditions. In noisy conditions, an increasing number of CNN layers improves ASR’s robustness. The CNN method achieves 77.38% accuracy in clean conditions and 87.59% in noisy conditions.

查看原文本刊更多论文

基于卷积神经网络(CNN)的鲁棒语音命令识别深度学习

移动设备的快速发展使得通过语音进行的人机交互日益普及和有效。自动语音识别(ASR)技术的快速发展使这种情况成为可能。ASR可以将人的语音信号转换为文本，并将其解释为计算机系统执行的命令。ASR仍然面临的挑战之一是系统性能在嘈杂环境中会显著下降。本研究旨在构建一个能够在清洁和噪声条件下识别人类语音的语音识别系统，使系统在噪声条件下也能识别语音命令。深度神经网络(Deep Neural Network, DNN)方法是ASR的主流方法。但是，人们对使用卷积神经网络(CNN)越来越感兴趣。在本研究中，我们开发了基于cnn的架构，用于语音命令的鲁棒ASR。我们探索了CNN架构层的不同深度，以改进鲁棒的语音识别系统。我们还使用提前停止和两种类型的优化器，即Adam和SGD(随机梯度下降)来优化最佳模型。我们的实验表明，CNN的准确率为90.64%，而DNN模型在清洁条件下的准确率为86.74%。在噪声条件下，CNN层数的增加提高了ASR的鲁棒性。CNN方法在清洁条件下的准确率为77.38%，在有噪声条件下的准确率为87.59%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications

自引率

0.00%

发文量