{"title":"CNN在语音转文本中的应用——不同梯度优化器的比较分析","authors":"Theodora Gaiceanu, O. Pastravanu","doi":"10.1109/SACI51354.2021.9465635","DOIUrl":null,"url":null,"abstract":"In this paper the authors have developed a Convolutional Neural Network architecture adapted to Speech-to-Text research field. This type of network has been chosen due to its capacity to extract the relevant features and its popularity in classification problems. A particular model for a Speech-to-Text application has been designed. The parameters of the model (i.e. the size of filters and kernels), and the number of the layers have been chosen by conducting appropriate experiments, and the model that ensured the highest accuracy has been selected. The model takes raw waveforms of spoken digits as input, and outputs a text with the predicted digit. The network is capable of providing the right digit no matter the gender or age of the speaker. The overfitting has been avoided by using Dropout layers and early stopping function. In order to select the best model, the authors have taken into account two basic criteria: the accuracy of the model, and the execution time, respectively. Considering the computational time, the first order cost function has been chosen. By testing different gradient descent optimization algorithms, the best optimizer has been selected. The application has been developed using Python programming language.","PeriodicalId":321907,"journal":{"name":"2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"On CNN Applied to Speech-to-Text – Comparative Analysis of Different Gradient Based Optimizers\",\"authors\":\"Theodora Gaiceanu, O. Pastravanu\",\"doi\":\"10.1109/SACI51354.2021.9465635\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper the authors have developed a Convolutional Neural Network architecture adapted to Speech-to-Text research field. This type of network has been chosen due to its capacity to extract the relevant features and its popularity in classification problems. A particular model for a Speech-to-Text application has been designed. The parameters of the model (i.e. the size of filters and kernels), and the number of the layers have been chosen by conducting appropriate experiments, and the model that ensured the highest accuracy has been selected. The model takes raw waveforms of spoken digits as input, and outputs a text with the predicted digit. The network is capable of providing the right digit no matter the gender or age of the speaker. The overfitting has been avoided by using Dropout layers and early stopping function. In order to select the best model, the authors have taken into account two basic criteria: the accuracy of the model, and the execution time, respectively. Considering the computational time, the first order cost function has been chosen. By testing different gradient descent optimization algorithms, the best optimizer has been selected. The application has been developed using Python programming language.\",\"PeriodicalId\":321907,\"journal\":{\"name\":\"2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-05-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SACI51354.2021.9465635\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SACI51354.2021.9465635","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
On CNN Applied to Speech-to-Text – Comparative Analysis of Different Gradient Based Optimizers
In this paper the authors have developed a Convolutional Neural Network architecture adapted to Speech-to-Text research field. This type of network has been chosen due to its capacity to extract the relevant features and its popularity in classification problems. A particular model for a Speech-to-Text application has been designed. The parameters of the model (i.e. the size of filters and kernels), and the number of the layers have been chosen by conducting appropriate experiments, and the model that ensured the highest accuracy has been selected. The model takes raw waveforms of spoken digits as input, and outputs a text with the predicted digit. The network is capable of providing the right digit no matter the gender or age of the speaker. The overfitting has been avoided by using Dropout layers and early stopping function. In order to select the best model, the authors have taken into account two basic criteria: the accuracy of the model, and the execution time, respectively. Considering the computational time, the first order cost function has been chosen. By testing different gradient descent optimization algorithms, the best optimizer has been selected. The application has been developed using Python programming language.