Jaybrata Chakraborty, Bappaditya Chakraborty, U. Bhattacharya
{"title":"口语的密集识别","authors":"Jaybrata Chakraborty, Bappaditya Chakraborty, U. Bhattacharya","doi":"10.1109/ICPR48806.2021.9412413","DOIUrl":null,"url":null,"abstract":"In the present study, we have considered a large number (27) of Indian languages for recognition from their speech signals of different sources. A dense convolutional network architecture (DenseNet) has been used for this classification task. Dynamic elimination of low energy frames from the input speech signal has been considered as a preprocessing operation. Mel-spectrogram of pre-processed speech signal is fed as input to the DenseNet architecture. Language recognition performance of this architecture has been compared with that of several state-of-the-art deep architectures which include a convolutional neural network (CNN), ResNet, CNN-BLSTM and DenseNet-BLSTM hybrid architectures. Additionally, we obtained recognition performances of a stacked BLSTM architecture fed with different sets of handcrafted features for comparison purpose. Simulations for both speaker independent and speaker dependent scenarios have been performed on two different standard datasets which include (i) IITKGP-MLILSC dataset of news clips in 27 different Indian languages and (ii) Linguistic Data Consortium (LDC) dataset of telephonic conversations in 5 different Indian languages. In each case, recognition performance of the DenseNet architecture along with Mel-spectrogram features has been found to be significantly better than all other frameworks implemented in this study.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"78 1","pages":"9674-9681"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"DenseRecognition of Spoken Languages\",\"authors\":\"Jaybrata Chakraborty, Bappaditya Chakraborty, U. Bhattacharya\",\"doi\":\"10.1109/ICPR48806.2021.9412413\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the present study, we have considered a large number (27) of Indian languages for recognition from their speech signals of different sources. A dense convolutional network architecture (DenseNet) has been used for this classification task. Dynamic elimination of low energy frames from the input speech signal has been considered as a preprocessing operation. Mel-spectrogram of pre-processed speech signal is fed as input to the DenseNet architecture. Language recognition performance of this architecture has been compared with that of several state-of-the-art deep architectures which include a convolutional neural network (CNN), ResNet, CNN-BLSTM and DenseNet-BLSTM hybrid architectures. Additionally, we obtained recognition performances of a stacked BLSTM architecture fed with different sets of handcrafted features for comparison purpose. Simulations for both speaker independent and speaker dependent scenarios have been performed on two different standard datasets which include (i) IITKGP-MLILSC dataset of news clips in 27 different Indian languages and (ii) Linguistic Data Consortium (LDC) dataset of telephonic conversations in 5 different Indian languages. In each case, recognition performance of the DenseNet architecture along with Mel-spectrogram features has been found to be significantly better than all other frameworks implemented in this study.\",\"PeriodicalId\":6783,\"journal\":{\"name\":\"2020 25th International Conference on Pattern Recognition (ICPR)\",\"volume\":\"78 1\",\"pages\":\"9674-9681\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 25th International Conference on Pattern Recognition (ICPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPR48806.2021.9412413\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 25th International Conference on Pattern Recognition (ICPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPR48806.2021.9412413","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
In the present study, we have considered a large number (27) of Indian languages for recognition from their speech signals of different sources. A dense convolutional network architecture (DenseNet) has been used for this classification task. Dynamic elimination of low energy frames from the input speech signal has been considered as a preprocessing operation. Mel-spectrogram of pre-processed speech signal is fed as input to the DenseNet architecture. Language recognition performance of this architecture has been compared with that of several state-of-the-art deep architectures which include a convolutional neural network (CNN), ResNet, CNN-BLSTM and DenseNet-BLSTM hybrid architectures. Additionally, we obtained recognition performances of a stacked BLSTM architecture fed with different sets of handcrafted features for comparison purpose. Simulations for both speaker independent and speaker dependent scenarios have been performed on two different standard datasets which include (i) IITKGP-MLILSC dataset of news clips in 27 different Indian languages and (ii) Linguistic Data Consortium (LDC) dataset of telephonic conversations in 5 different Indian languages. In each case, recognition performance of the DenseNet architecture along with Mel-spectrogram features has been found to be significantly better than all other frameworks implemented in this study.