探索端到端语音识别的模型单元和训练策略

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) Pub Date : 2019-12-01 DOI:10.1109/ASRU46091.2019.9003834

Mingkun Huang, Yizhou Lu, Lan Wang, Y. Qian, Kai Yu

{"title":"探索端到端语音识别的模型单元和训练策略","authors":"Mingkun Huang, Yizhou Lu, Lan Wang, Y. Qian, Kai Yu","doi":"10.1109/ASRU46091.2019.9003834","DOIUrl":null,"url":null,"abstract":"In this work, we explore end-to-end speech recognition models (CTC, RNN-Transducer and attention-based models) with different model units (character, wordpiece and word) and various training strategies. We show that wordpiece unit outperforms character unit for all end-to-end systems on the Switchboard Hub5'00 benchmark. To improve the performance of end-to-end systems, we propose a multi-stage pretraining strategy, which gives 25.0% and 18.0% relative improvements over training from scratch for attention and RNN-T models respectively with wordpiece units. We achieve state-of-the-art performance on the Switchboard+Fisher-2000h task, outperforming all prior work. Together with other training strategies such as label smoothing and data augmentation, we achieve 5.9%/12.1% WER on the Switch-board/CallHome test set without using any external language models. This is a new performance milestone for a single end-to-end system, and it is also much better than the previous published best hybrid system, which is 6.7%/12.5% on each set individually.","PeriodicalId":150913,"journal":{"name":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","volume":"1 5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Exploring Model Units and Training Strategies for End-to-End Speech Recognition\",\"authors\":\"Mingkun Huang, Yizhou Lu, Lan Wang, Y. Qian, Kai Yu\",\"doi\":\"10.1109/ASRU46091.2019.9003834\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this work, we explore end-to-end speech recognition models (CTC, RNN-Transducer and attention-based models) with different model units (character, wordpiece and word) and various training strategies. We show that wordpiece unit outperforms character unit for all end-to-end systems on the Switchboard Hub5'00 benchmark. To improve the performance of end-to-end systems, we propose a multi-stage pretraining strategy, which gives 25.0% and 18.0% relative improvements over training from scratch for attention and RNN-T models respectively with wordpiece units. We achieve state-of-the-art performance on the Switchboard+Fisher-2000h task, outperforming all prior work. Together with other training strategies such as label smoothing and data augmentation, we achieve 5.9%/12.1% WER on the Switch-board/CallHome test set without using any external language models. This is a new performance milestone for a single end-to-end system, and it is also much better than the previous published best hybrid system, which is 6.7%/12.5% on each set individually.\",\"PeriodicalId\":150913,\"journal\":{\"name\":\"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"volume\":\"1 5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ASRU46091.2019.9003834\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASRU46091.2019.9003834","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

在这项工作中，我们探索了端到端语音识别模型(CTC, RNN-Transducer和基于注意力的模型)，这些模型具有不同的模型单元(字符，词块和词)和各种训练策略。我们表明，在交换机Hub5'00基准测试中，字词单元优于字符单元，适用于所有端到端系统。为了提高端到端系统的性能，我们提出了一种多阶段预训练策略，该策略比从头开始训练的注意力和RNN-T模型分别提高了25.0%和18.0%。我们在总机+Fisher-2000h任务中实现了最先进的性能，优于所有先前的工作。再加上标签平滑和数据增强等其他训练策略，我们在没有使用任何外部语言模型的情况下，在Switch-board/CallHome测试集上实现了5.9%/12.1%的WER。对于单个端到端系统来说，这是一个新的性能里程碑，而且也比之前发布的最佳混合系统好得多，后者在每台设备上分别是6.7%/12.5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploring Model Units and Training Strategies for End-to-End Speech Recognition

In this work, we explore end-to-end speech recognition models (CTC, RNN-Transducer and attention-based models) with different model units (character, wordpiece and word) and various training strategies. We show that wordpiece unit outperforms character unit for all end-to-end systems on the Switchboard Hub5'00 benchmark. To improve the performance of end-to-end systems, we propose a multi-stage pretraining strategy, which gives 25.0% and 18.0% relative improvements over training from scratch for attention and RNN-T models respectively with wordpiece units. We achieve state-of-the-art performance on the Switchboard+Fisher-2000h task, outperforming all prior work. Together with other training strategies such as label smoothing and data augmentation, we achieve 5.9%/12.1% WER on the Switch-board/CallHome test set without using any external language models. This is a new performance milestone for a single end-to-end system, and it is also much better than the previous published best hybrid system, which is 6.7%/12.5% on each set individually.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

自引率

0.00%

发文量