口语的密集识别

2020 25th International Conference on Pattern Recognition (ICPR) Pub Date : 2021-01-10 DOI:10.1109/ICPR48806.2021.9412413

Jaybrata Chakraborty, Bappaditya Chakraborty, U. Bhattacharya

{"title":"口语的密集识别","authors":"Jaybrata Chakraborty, Bappaditya Chakraborty, U. Bhattacharya","doi":"10.1109/ICPR48806.2021.9412413","DOIUrl":null,"url":null,"abstract":"In the present study, we have considered a large number (27) of Indian languages for recognition from their speech signals of different sources. A dense convolutional network architecture (DenseNet) has been used for this classification task. Dynamic elimination of low energy frames from the input speech signal has been considered as a preprocessing operation. Mel-spectrogram of pre-processed speech signal is fed as input to the DenseNet architecture. Language recognition performance of this architecture has been compared with that of several state-of-the-art deep architectures which include a convolutional neural network (CNN), ResNet, CNN-BLSTM and DenseNet-BLSTM hybrid architectures. Additionally, we obtained recognition performances of a stacked BLSTM architecture fed with different sets of handcrafted features for comparison purpose. Simulations for both speaker independent and speaker dependent scenarios have been performed on two different standard datasets which include (i) IITKGP-MLILSC dataset of news clips in 27 different Indian languages and (ii) Linguistic Data Consortium (LDC) dataset of telephonic conversations in 5 different Indian languages. In each case, recognition performance of the DenseNet architecture along with Mel-spectrogram features has been found to be significantly better than all other frameworks implemented in this study.","PeriodicalId":6783,"journal":{"name":"2020 25th International Conference on Pattern Recognition (ICPR)","volume":"78 1","pages":"9674-9681"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"DenseRecognition of Spoken Languages\",\"authors\":\"Jaybrata Chakraborty, Bappaditya Chakraborty, U. Bhattacharya\",\"doi\":\"10.1109/ICPR48806.2021.9412413\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the present study, we have considered a large number (27) of Indian languages for recognition from their speech signals of different sources. A dense convolutional network architecture (DenseNet) has been used for this classification task. Dynamic elimination of low energy frames from the input speech signal has been considered as a preprocessing operation. Mel-spectrogram of pre-processed speech signal is fed as input to the DenseNet architecture. Language recognition performance of this architecture has been compared with that of several state-of-the-art deep architectures which include a convolutional neural network (CNN), ResNet, CNN-BLSTM and DenseNet-BLSTM hybrid architectures. Additionally, we obtained recognition performances of a stacked BLSTM architecture fed with different sets of handcrafted features for comparison purpose. Simulations for both speaker independent and speaker dependent scenarios have been performed on two different standard datasets which include (i) IITKGP-MLILSC dataset of news clips in 27 different Indian languages and (ii) Linguistic Data Consortium (LDC) dataset of telephonic conversations in 5 different Indian languages. In each case, recognition performance of the DenseNet architecture along with Mel-spectrogram features has been found to be significantly better than all other frameworks implemented in this study.\",\"PeriodicalId\":6783,\"journal\":{\"name\":\"2020 25th International Conference on Pattern Recognition (ICPR)\",\"volume\":\"78 1\",\"pages\":\"9674-9681\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 25th International Conference on Pattern Recognition (ICPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPR48806.2021.9412413\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 25th International Conference on Pattern Recognition (ICPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPR48806.2021.9412413","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

在本研究中，我们考虑了大量(27)种印度语言从不同来源的语音信号中进行识别。一个密集卷积网络架构(DenseNet)被用于这个分类任务。从输入语音信号中动态消除低能量帧被认为是一种预处理操作。将预处理语音信号的mel谱图作为DenseNet体系结构的输入。将该体系结构的语言识别性能与卷积神经网络(CNN)、ResNet、CNN- blstm和DenseNet-BLSTM混合体系结构等几种最先进的深度体系结构进行了比较。此外，为了进行比较，我们获得了由不同组手工特征馈送的堆叠BLSTM体系结构的识别性能。在两种不同的标准数据集上进行了演讲者独立和演讲者依赖场景的模拟，其中包括(i) 27种不同印度语言的IITKGP-MLILSC新闻片段数据集和(ii)语言数据联盟(LDC) 5种不同印度语言的电话对话数据集。在每种情况下，DenseNet架构以及mel谱图特征的识别性能都明显优于本研究中实现的所有其他框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DenseRecognition of Spoken Languages

In the present study, we have considered a large number (27) of Indian languages for recognition from their speech signals of different sources. A dense convolutional network architecture (DenseNet) has been used for this classification task. Dynamic elimination of low energy frames from the input speech signal has been considered as a preprocessing operation. Mel-spectrogram of pre-processed speech signal is fed as input to the DenseNet architecture. Language recognition performance of this architecture has been compared with that of several state-of-the-art deep architectures which include a convolutional neural network (CNN), ResNet, CNN-BLSTM and DenseNet-BLSTM hybrid architectures. Additionally, we obtained recognition performances of a stacked BLSTM architecture fed with different sets of handcrafted features for comparison purpose. Simulations for both speaker independent and speaker dependent scenarios have been performed on two different standard datasets which include (i) IITKGP-MLILSC dataset of news clips in 27 different Indian languages and (ii) Linguistic Data Consortium (LDC) dataset of telephonic conversations in 5 different Indian languages. In each case, recognition performance of the DenseNet architecture along with Mel-spectrogram features has been found to be significantly better than all other frameworks implemented in this study.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 25th International Conference on Pattern Recognition (ICPR)

自引率

0.00%

发文量