Yoga F. Utomo, E. C. Djamal, Fikri Nugraha, F. Renaldi
{"title":"Spoken Word and Speaker Recognition Using MFCC and Multiple Recurrent Neural Networks","authors":"Yoga F. Utomo, E. C. Djamal, Fikri Nugraha, F. Renaldi","doi":"10.23919/EECSI50503.2020.9251870","DOIUrl":null,"url":null,"abstract":"Identification of spoken word and speaker has been featured in many kinds of research. The problem or obstacle that persists is in the pronunciation of a particular word. So it is the noise that causes the difficulty of words to be identified. Furthermore, every human has different pronunciation habits and is influenced by several variables, such as amplitude, frequency, tempo, and rhythmic. This study proposed the identification of spoken sounds by using specific word input to determine the patterns of the speaker and spoken using Mel-frequency Cepstrum Coefficients (MFCC) and Multiple Recurrent Neural Networks (RNN). The Mel coefficient of MFCC is used as an input feature for identifying spoken words and speakers using RNN and Long Short Term Memory (LSTM). Multiple RNN works spoken word and speaker in parallel. The results obtained by multiple RNN have an accuracy of 87.74%, while single RNNs have 80.58% using Adam of new data. In order to test our model computational regularly, the experiment tested K-fold Cross-Validation of datasets for spoken and speakers with an average accuracy of 86.07%, which means the model to be able to learn on the dataset without being affected by the order or selection of test data.","PeriodicalId":6743,"journal":{"name":"2020 7th International Conference on Electrical Engineering, Computer Sciences and Informatics (EECSI)","volume":"221 1","pages":"192-197"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 7th International Conference on Electrical Engineering, Computer Sciences and Informatics (EECSI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/EECSI50503.2020.9251870","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Identification of spoken word and speaker has been featured in many kinds of research. The problem or obstacle that persists is in the pronunciation of a particular word. So it is the noise that causes the difficulty of words to be identified. Furthermore, every human has different pronunciation habits and is influenced by several variables, such as amplitude, frequency, tempo, and rhythmic. This study proposed the identification of spoken sounds by using specific word input to determine the patterns of the speaker and spoken using Mel-frequency Cepstrum Coefficients (MFCC) and Multiple Recurrent Neural Networks (RNN). The Mel coefficient of MFCC is used as an input feature for identifying spoken words and speakers using RNN and Long Short Term Memory (LSTM). Multiple RNN works spoken word and speaker in parallel. The results obtained by multiple RNN have an accuracy of 87.74%, while single RNNs have 80.58% using Adam of new data. In order to test our model computational regularly, the experiment tested K-fold Cross-Validation of datasets for spoken and speakers with an average accuracy of 86.07%, which means the model to be able to learn on the dataset without being affected by the order or selection of test data.