基于深度CNN-LSTM编码器的端到端低资源语音识别

Wen Wang, Xiaodong Yang, Hongwu Yang
{"title":"基于深度CNN-LSTM编码器的端到端低资源语音识别","authors":"Wen Wang, Xiaodong Yang, Hongwu Yang","doi":"10.1109/ICICSP50920.2020.9232119","DOIUrl":null,"url":null,"abstract":"The performance of large vocabulary continuous automatic speech recognition (ASR) has improved tremendously due to the application of deep learning. However, building a low-resource ASR system remains a challenging task due to the difficulty of data collection and the lack of linguistic knowledge in the low-resource language. In this paper, we proposed an end-to-end low-resource speech recognition method and validated using the Tibetan language as an example. We firstly designed a Tibetan text corpus, and we also recorded a Tibetan speech corpus. Then, we extracted the spectrogram as a feature of each speech. The encoder is a single structure of the deep convolutional neural network (CNN) based on the VGG network and a hybrid network of deep CNN and long short-term memory (LSTM) network. The connectionist temporal classification (CTC) network sits on top of the encoder to infer the alignments between speech and label sequences. The experimental results show that the single structure encoder achieved a word error rate of 36.85%. The hybrid structure encoder achieves a 6% word error rate reduction compared to a single structure encoder. Also, when increasing the number of CNN layers in the encoder, the word error rate is further reduced.","PeriodicalId":117760,"journal":{"name":"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"End-to-End Low-Resource Speech Recognition with a Deep CNN-LSTM Encoder\",\"authors\":\"Wen Wang, Xiaodong Yang, Hongwu Yang\",\"doi\":\"10.1109/ICICSP50920.2020.9232119\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The performance of large vocabulary continuous automatic speech recognition (ASR) has improved tremendously due to the application of deep learning. However, building a low-resource ASR system remains a challenging task due to the difficulty of data collection and the lack of linguistic knowledge in the low-resource language. In this paper, we proposed an end-to-end low-resource speech recognition method and validated using the Tibetan language as an example. We firstly designed a Tibetan text corpus, and we also recorded a Tibetan speech corpus. Then, we extracted the spectrogram as a feature of each speech. The encoder is a single structure of the deep convolutional neural network (CNN) based on the VGG network and a hybrid network of deep CNN and long short-term memory (LSTM) network. The connectionist temporal classification (CTC) network sits on top of the encoder to infer the alignments between speech and label sequences. The experimental results show that the single structure encoder achieved a word error rate of 36.85%. The hybrid structure encoder achieves a 6% word error rate reduction compared to a single structure encoder. Also, when increasing the number of CNN layers in the encoder, the word error rate is further reduced.\",\"PeriodicalId\":117760,\"journal\":{\"name\":\"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICICSP50920.2020.9232119\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICSP50920.2020.9232119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

深度学习的应用极大地提高了大词汇量连续自动语音识别(ASR)的性能。然而,由于数据收集困难和缺乏低资源语言的语言知识,构建低资源ASR系统仍然是一项具有挑战性的任务。本文提出了一种端到端的低资源语音识别方法,并以藏语为例进行了验证。我们首先设计了一个藏文文本语料库,并记录了一个藏文语音语料库。然后,我们提取频谱图作为每个语音的特征。编码器是基于VGG网络的深度卷积神经网络(CNN)的单一结构和深度CNN与长短期记忆(LSTM)网络的混合网络。连接时间分类(CTC)网络位于编码器的顶部,以推断语音和标签序列之间的对齐。实验结果表明,单结构编码器的字错误率为36.85%。与单一结构编码器相比,混合结构编码器实现了6%的字错误率降低。此外,当增加编码器中的CNN层数时,单词错误率进一步降低。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
End-to-End Low-Resource Speech Recognition with a Deep CNN-LSTM Encoder
The performance of large vocabulary continuous automatic speech recognition (ASR) has improved tremendously due to the application of deep learning. However, building a low-resource ASR system remains a challenging task due to the difficulty of data collection and the lack of linguistic knowledge in the low-resource language. In this paper, we proposed an end-to-end low-resource speech recognition method and validated using the Tibetan language as an example. We firstly designed a Tibetan text corpus, and we also recorded a Tibetan speech corpus. Then, we extracted the spectrogram as a feature of each speech. The encoder is a single structure of the deep convolutional neural network (CNN) based on the VGG network and a hybrid network of deep CNN and long short-term memory (LSTM) network. The connectionist temporal classification (CTC) network sits on top of the encoder to infer the alignments between speech and label sequences. The experimental results show that the single structure encoder achieved a word error rate of 36.85%. The hybrid structure encoder achieves a 6% word error rate reduction compared to a single structure encoder. Also, when increasing the number of CNN layers in the encoder, the word error rate is further reduced.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信