{"title":"基于深度CNN-LSTM编码器的端到端低资源语音识别","authors":"Wen Wang, Xiaodong Yang, Hongwu Yang","doi":"10.1109/ICICSP50920.2020.9232119","DOIUrl":null,"url":null,"abstract":"The performance of large vocabulary continuous automatic speech recognition (ASR) has improved tremendously due to the application of deep learning. However, building a low-resource ASR system remains a challenging task due to the difficulty of data collection and the lack of linguistic knowledge in the low-resource language. In this paper, we proposed an end-to-end low-resource speech recognition method and validated using the Tibetan language as an example. We firstly designed a Tibetan text corpus, and we also recorded a Tibetan speech corpus. Then, we extracted the spectrogram as a feature of each speech. The encoder is a single structure of the deep convolutional neural network (CNN) based on the VGG network and a hybrid network of deep CNN and long short-term memory (LSTM) network. The connectionist temporal classification (CTC) network sits on top of the encoder to infer the alignments between speech and label sequences. The experimental results show that the single structure encoder achieved a word error rate of 36.85%. The hybrid structure encoder achieves a 6% word error rate reduction compared to a single structure encoder. Also, when increasing the number of CNN layers in the encoder, the word error rate is further reduced.","PeriodicalId":117760,"journal":{"name":"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"End-to-End Low-Resource Speech Recognition with a Deep CNN-LSTM Encoder\",\"authors\":\"Wen Wang, Xiaodong Yang, Hongwu Yang\",\"doi\":\"10.1109/ICICSP50920.2020.9232119\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The performance of large vocabulary continuous automatic speech recognition (ASR) has improved tremendously due to the application of deep learning. However, building a low-resource ASR system remains a challenging task due to the difficulty of data collection and the lack of linguistic knowledge in the low-resource language. In this paper, we proposed an end-to-end low-resource speech recognition method and validated using the Tibetan language as an example. We firstly designed a Tibetan text corpus, and we also recorded a Tibetan speech corpus. Then, we extracted the spectrogram as a feature of each speech. The encoder is a single structure of the deep convolutional neural network (CNN) based on the VGG network and a hybrid network of deep CNN and long short-term memory (LSTM) network. The connectionist temporal classification (CTC) network sits on top of the encoder to infer the alignments between speech and label sequences. The experimental results show that the single structure encoder achieved a word error rate of 36.85%. The hybrid structure encoder achieves a 6% word error rate reduction compared to a single structure encoder. Also, when increasing the number of CNN layers in the encoder, the word error rate is further reduced.\",\"PeriodicalId\":117760,\"journal\":{\"name\":\"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICICSP50920.2020.9232119\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 3rd International Conference on Information Communication and Signal Processing (ICICSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICICSP50920.2020.9232119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
End-to-End Low-Resource Speech Recognition with a Deep CNN-LSTM Encoder
The performance of large vocabulary continuous automatic speech recognition (ASR) has improved tremendously due to the application of deep learning. However, building a low-resource ASR system remains a challenging task due to the difficulty of data collection and the lack of linguistic knowledge in the low-resource language. In this paper, we proposed an end-to-end low-resource speech recognition method and validated using the Tibetan language as an example. We firstly designed a Tibetan text corpus, and we also recorded a Tibetan speech corpus. Then, we extracted the spectrogram as a feature of each speech. The encoder is a single structure of the deep convolutional neural network (CNN) based on the VGG network and a hybrid network of deep CNN and long short-term memory (LSTM) network. The connectionist temporal classification (CTC) network sits on top of the encoder to infer the alignments between speech and label sequences. The experimental results show that the single structure encoder achieved a word error rate of 36.85%. The hybrid structure encoder achieves a 6% word error rate reduction compared to a single structure encoder. Also, when increasing the number of CNN layers in the encoder, the word error rate is further reduced.