{"title":"On CNN Applied to Speech-to-Text – Comparative Analysis of Different Gradient Based Optimizers","authors":"Theodora Gaiceanu, O. Pastravanu","doi":"10.1109/SACI51354.2021.9465635","DOIUrl":null,"url":null,"abstract":"In this paper the authors have developed a Convolutional Neural Network architecture adapted to Speech-to-Text research field. This type of network has been chosen due to its capacity to extract the relevant features and its popularity in classification problems. A particular model for a Speech-to-Text application has been designed. The parameters of the model (i.e. the size of filters and kernels), and the number of the layers have been chosen by conducting appropriate experiments, and the model that ensured the highest accuracy has been selected. The model takes raw waveforms of spoken digits as input, and outputs a text with the predicted digit. The network is capable of providing the right digit no matter the gender or age of the speaker. The overfitting has been avoided by using Dropout layers and early stopping function. In order to select the best model, the authors have taken into account two basic criteria: the accuracy of the model, and the execution time, respectively. Considering the computational time, the first order cost function has been chosen. By testing different gradient descent optimization algorithms, the best optimizer has been selected. The application has been developed using Python programming language.","PeriodicalId":321907,"journal":{"name":"2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SACI51354.2021.9465635","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
In this paper the authors have developed a Convolutional Neural Network architecture adapted to Speech-to-Text research field. This type of network has been chosen due to its capacity to extract the relevant features and its popularity in classification problems. A particular model for a Speech-to-Text application has been designed. The parameters of the model (i.e. the size of filters and kernels), and the number of the layers have been chosen by conducting appropriate experiments, and the model that ensured the highest accuracy has been selected. The model takes raw waveforms of spoken digits as input, and outputs a text with the predicted digit. The network is capable of providing the right digit no matter the gender or age of the speaker. The overfitting has been avoided by using Dropout layers and early stopping function. In order to select the best model, the authors have taken into account two basic criteria: the accuracy of the model, and the execution time, respectively. Considering the computational time, the first order cost function has been chosen. By testing different gradient descent optimization algorithms, the best optimizer has been selected. The application has been developed using Python programming language.