Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP) Pub Date : 2020-05-21 DOI:10.1109/ISCSLP49672.2021.9362086

Zhiping Zeng, V. T. Pham, Haihua Xu, Yerbolat Khassanov, Chng Eng Siong, Chongjia Ni, B. Ma

{"title":"Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning","authors":"Zhiping Zeng, V. T. Pham, Haihua Xu, Yerbolat Khassanov, Chng Eng Siong, Chongjia Ni, B. Ma","doi":"10.1109/ISCSLP49672.2021.9362086","DOIUrl":null,"url":null,"abstract":"In this work, we study leveraging extra text data to improve low- resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend the prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the LSTM-based independent language model network. We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text. Results show that the proposed architecture outperforms the previous LSTM-based architecture [1] by 24.2% relative word error rate (WER) when both are trained using limited labeled data. Starting from this, we obtain further 25.4% relative WER reduction by transfer learning from another resource-rich language. Moreover, we obtain additional 13.6% relative WER reduction by boosting the LSTM decoder of the transferred model with the extra text data. Overall, our best model outperforms the vanilla Transformer ASR by11.9% relative WER. Last but not least, the proposed hybrid architecture offers much faster inference compared to both LSTM and Transformer architectures.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCSLP49672.2021.9362086","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

In this work, we study leveraging extra text data to improve low- resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend the prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the LSTM-based independent language model network. We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text. Results show that the proposed architecture outperforms the previous LSTM-based architecture [1] by 24.2% relative word error rate (WER) when both are trained using limited labeled data. Starting from this, we obtain further 25.4% relative WER reduction by transfer learning from another resource-rich language. Moreover, we obtain additional 13.6% relative WER reduction by boosting the LSTM decoder of the transferred model with the extra text data. Overall, our best model outperforms the vanilla Transformer ASR by11.9% relative WER. Last but not least, the proposed hybrid architecture offers much faster inference compared to both LSTM and Transformer architectures.

查看原文本刊更多论文

在迁移学习中使用基于混合变压器- lstm的端到端ASR来利用文本数据

在这项工作中，我们研究了利用额外的文本数据来改善跨语言迁移学习环境下低资源的端到端ASR。为此，我们扩展了先前的工作[1]，并提出了一种基于混合变压器- lstm的体系结构。这种体系结构不仅利用了Transformer网络的高效编码能力，而且还得益于基于lstm的独立语言模型网络所带来的额外文本数据。我们在内部马来语语料库上进行实验，其中包含有限的标记数据和大量的额外文本。结果表明，当两者都使用有限的标记数据进行训练时，所提出的体系结构比之前基于lstm的体系结构[1]高出24.2%的相对词错误率(WER)。在此基础上，通过从另一种资源丰富的语言迁移学习，我们获得了25.4%的相对WER降低。此外，通过使用额外的文本数据增强传输模型的LSTM解码器，我们获得了额外的13.6%的相对WER降低。总的来说，我们最好的模型比普通Transformer的ASR高出11.9%。最后但并非最不重要的是，与LSTM和Transformer体系结构相比，所提出的混合体系结构提供了更快的推理。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)

自引率

0.00%

发文量