基于深度CNN-LSTM编码器的端到端低资源语音识别

2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP) Pub Date : 2020-09-01 DOI:10.1109/ICICSP50920.2020.9232119

Wen Wang, Xiaodong Yang, Hongwu Yang

{"title":"基于深度CNN-LSTM编码器的端到端低资源语音识别","authors":"Wen Wang, Xiaodong Yang, Hongwu Yang","doi":"10.1109/ICICSP50920.2020.9232119","DOIUrl":null,"url":null,"abstract":"The performance of large vocabulary continuous automatic speech recognition (ASR) has improved tremendously due to the application of deep learning. However, building a low-resource ASR system remains a challenging task due to the difficulty of data collection and the lack of linguistic knowledge in the low-resource language. In this paper, we proposed an end-to-end low-resource speech recognition method and validated using the Tibetan language as an example. We firstly designed a Tibetan text corpus, and we also recorded a Tibetan speech corpus. Then, we extracted the spectrogram as a feature of each speech. The encoder is a single structure of the deep convolutional neural network (CNN) based on the VGG network and a hybrid network of deep CNN and long short-term memory (LSTM) network. The connectionist temporal classification (CTC) network sits on top of the encoder to infer the alignments between speech and label sequences. The experimental results show that the single structure encoder achieved a word error rate of 36.85%. The hybrid structure encoder achieves a 6% word error rate reduction compared to a single structure encoder. Also, when increasing the number of CNN layers in the encoder, the word error rate is further reduced.","PeriodicalId":117760,"journal":{"name":"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"End-to-End Low-Resource Speech Recognition with a Deep CNN-LSTM Encoder\",\"authors\":\"Wen Wang, Xiaodong Yang, Hongwu Yang\",\"doi\":\"10.1109/ICICSP50920.2020.9232119\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The performance of large vocabulary continuous automatic speech recognition (ASR) has improved tremendously due to the application of deep learning. However, building a low-resource ASR system remains a challenging task due to the difficulty of data collection and the lack of linguistic knowledge in the low-resource language. In this paper, we proposed an end-to-end low-resource speech recognition method and validated using the Tibetan language as an example. We firstly designed a Tibetan text corpus, and we also recorded a Tibetan speech corpus. Then, we extracted the spectrogram as a feature of each speech. The encoder is a single structure of the deep convolutional neural network (CNN) based on the VGG network and a hybrid network of deep CNN and long short-term memory (LSTM) network. The connectionist temporal classification (CTC) network sits on top of the encoder to infer the alignments between speech and label sequences. The experimental results show that the single structure encoder achieved a word error rate of 36.85%. The hybrid structure encoder achieves a 6% word error rate reduction compared to a single structure encoder. Also, when increasing the number of CNN layers in the encoder, the word error rate is further reduced.\",\"PeriodicalId\":117760,\"journal\":{\"name\":\"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICICSP50920.2020.9232119\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICSP50920.2020.9232119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

深度学习的应用极大地提高了大词汇量连续自动语音识别(ASR)的性能。然而，由于数据收集困难和缺乏低资源语言的语言知识，构建低资源ASR系统仍然是一项具有挑战性的任务。本文提出了一种端到端的低资源语音识别方法，并以藏语为例进行了验证。我们首先设计了一个藏文文本语料库，并记录了一个藏文语音语料库。然后，我们提取频谱图作为每个语音的特征。编码器是基于VGG网络的深度卷积神经网络(CNN)的单一结构和深度CNN与长短期记忆(LSTM)网络的混合网络。连接时间分类(CTC)网络位于编码器的顶部，以推断语音和标签序列之间的对齐。实验结果表明，单结构编码器的字错误率为36.85%。与单一结构编码器相比，混合结构编码器实现了6%的字错误率降低。此外，当增加编码器中的CNN层数时，单词错误率进一步降低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

End-to-End Low-Resource Speech Recognition with a Deep CNN-LSTM Encoder

The performance of large vocabulary continuous automatic speech recognition (ASR) has improved tremendously due to the application of deep learning. However, building a low-resource ASR system remains a challenging task due to the difficulty of data collection and the lack of linguistic knowledge in the low-resource language. In this paper, we proposed an end-to-end low-resource speech recognition method and validated using the Tibetan language as an example. We firstly designed a Tibetan text corpus, and we also recorded a Tibetan speech corpus. Then, we extracted the spectrogram as a feature of each speech. The encoder is a single structure of the deep convolutional neural network (CNN) based on the VGG network and a hybrid network of deep CNN and long short-term memory (LSTM) network. The connectionist temporal classification (CTC) network sits on top of the encoder to infer the alignments between speech and label sequences. The experimental results show that the single structure encoder achieved a word error rate of 36.85%. The hybrid structure encoder achieves a 6% word error rate reduction compared to a single structure encoder. Also, when increasing the number of CNN layers in the encoder, the word error rate is further reduced.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)

自引率

0.00%

发文量