Zahra Cantiabela, H. Pardede, Vicky Zilvan, W. Sulandari, R. S. Yuwana, A. A. Supianto, Dikdik Krisnandi
{"title":"Deep Learning for Robust Speech Command Recognition Using Convolutional Neural Networks (CNN)","authors":"Zahra Cantiabela, H. Pardede, Vicky Zilvan, W. Sulandari, R. S. Yuwana, A. A. Supianto, Dikdik Krisnandi","doi":"10.1145/3575882.3575902","DOIUrl":null,"url":null,"abstract":"The rapid development of mobile devices has made human-computer interaction through voice increasingly popular and effective. This condition is made possible by the rapid growth of Automatic Speech Recognition (ASR) technologies. ASR can convert human speech signals into text and interpret it as a command for computer systems to perform. One of the remaining challenges for ASR is that the system’s performance can degrade significantly in noisy environments. This research aims to build a speech recognition system capable of recognizing human speech in clean and noisy conditions, enabling the system to recognize speech commands even in noisy conditions. Deep Neural Network (DNN) methods are the dominant methods for ASR. But, there is increasing interest in using a convolutional neural network (CNN) instead. In this study, we develop CNN-based architectures for robust ASR for speech commands. We explore various depths of CNN’s layers of CNN architecture to improve a robust speech recognition system. We also optimized the best model using early stopping and two types of optimizers, i.e., Adam and SGD (Stochastic gradient descent). Our experiment shows that CNN exhibited an accuracy of 90.64%, while the DNN model exhibited 86.74% accuracy in clean conditions. In noisy conditions, an increasing number of CNN layers improves ASR’s robustness. The CNN method achieves 77.38% accuracy in clean conditions and 87.59% in noisy conditions.","PeriodicalId":367340,"journal":{"name":"Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 International Conference on Computer, Control, Informatics and Its Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3575882.3575902","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid development of mobile devices has made human-computer interaction through voice increasingly popular and effective. This condition is made possible by the rapid growth of Automatic Speech Recognition (ASR) technologies. ASR can convert human speech signals into text and interpret it as a command for computer systems to perform. One of the remaining challenges for ASR is that the system’s performance can degrade significantly in noisy environments. This research aims to build a speech recognition system capable of recognizing human speech in clean and noisy conditions, enabling the system to recognize speech commands even in noisy conditions. Deep Neural Network (DNN) methods are the dominant methods for ASR. But, there is increasing interest in using a convolutional neural network (CNN) instead. In this study, we develop CNN-based architectures for robust ASR for speech commands. We explore various depths of CNN’s layers of CNN architecture to improve a robust speech recognition system. We also optimized the best model using early stopping and two types of optimizers, i.e., Adam and SGD (Stochastic gradient descent). Our experiment shows that CNN exhibited an accuracy of 90.64%, while the DNN model exhibited 86.74% accuracy in clean conditions. In noisy conditions, an increasing number of CNN layers improves ASR’s robustness. The CNN method achieves 77.38% accuracy in clean conditions and 87.59% in noisy conditions.