基于自监督学习的吵闹学生教师对儿童ASR的培训

Shreya S. Chaturvedi, Hardik B. Sailor, H. Patil
{"title":"基于自监督学习的吵闹学生教师对儿童ASR的培训","authors":"Shreya S. Chaturvedi, Hardik B. Sailor, H. Patil","doi":"10.1109/SPCOM55316.2022.9840763","DOIUrl":null,"url":null,"abstract":"Automatic Speech Recognition (ASR) is a fast-growing field, where reliable systems are made for high resource languages and for adult’s speech. However, performance of such ASR system is inefficient for children speech, due to numerous acoustic variability in children speech and scarcity of resources. In this paper, we propose to use the unlabeled data extensively to develop ASR system for low resourced children speech. State-of-the-art wav2vec 2.0 is the baseline ASR technique used here. The baseline’s performance is further enhanced with the intuition of Noisy Student Teacher (NST) learning. The proposed technique is not only limited to introducing the use of soft labels (i.e., word-level transcription) of unlabeled data, but also adapts the learning of teacher model or preceding student model, which results in reduction of the redundant training significantly. To that effect, a detailed analysis is reported in this paper, as there is a difference in teacher and student learning. In ASR experiments, character-level tokenization was used and hence, Connectionist Temporal Classification (CTC) loss was used for fine-tuning. Due to computational limitations, experiments are performed with approximately 12 hours of training, and 5 hours of development and test data was used from standard My Science Tutor (MyST) corpus. The baseline wav2vec 2.0 achieves 34% WER, while relatively 10% of performance was improved using the proposed approach. Further, the analysis of performance loss and effect of language model is discussed in details.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Noisy Student Teacher Training with Self Supervised Learning for Children ASR\",\"authors\":\"Shreya S. Chaturvedi, Hardik B. Sailor, H. Patil\",\"doi\":\"10.1109/SPCOM55316.2022.9840763\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automatic Speech Recognition (ASR) is a fast-growing field, where reliable systems are made for high resource languages and for adult’s speech. However, performance of such ASR system is inefficient for children speech, due to numerous acoustic variability in children speech and scarcity of resources. In this paper, we propose to use the unlabeled data extensively to develop ASR system for low resourced children speech. State-of-the-art wav2vec 2.0 is the baseline ASR technique used here. The baseline’s performance is further enhanced with the intuition of Noisy Student Teacher (NST) learning. The proposed technique is not only limited to introducing the use of soft labels (i.e., word-level transcription) of unlabeled data, but also adapts the learning of teacher model or preceding student model, which results in reduction of the redundant training significantly. To that effect, a detailed analysis is reported in this paper, as there is a difference in teacher and student learning. In ASR experiments, character-level tokenization was used and hence, Connectionist Temporal Classification (CTC) loss was used for fine-tuning. Due to computational limitations, experiments are performed with approximately 12 hours of training, and 5 hours of development and test data was used from standard My Science Tutor (MyST) corpus. The baseline wav2vec 2.0 achieves 34% WER, while relatively 10% of performance was improved using the proposed approach. Further, the analysis of performance loss and effect of language model is discussed in details.\",\"PeriodicalId\":246982,\"journal\":{\"name\":\"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SPCOM55316.2022.9840763\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM55316.2022.9840763","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

自动语音识别(ASR)是一个快速发展的领域,它为高资源语言和成人语音建立了可靠的系统。然而,由于儿童语音存在大量的声学变异性和资源的稀缺性,这种ASR系统对儿童语音的表现效率较低。在本文中,我们建议广泛使用未标记的数据来开发低资源儿童语音的ASR系统。最先进的wav2vec 2.0是这里使用的基线ASR技术。噪声学生教师(NST)学习的直觉进一步增强了基线的性能。所提出的技术不仅局限于引入对未标记数据的软标签(即词级转录)的使用,而且还适应了教师模型或前学生模型的学习,从而显著减少了冗余训练。鉴于教师和学生的学习存在差异,本文对此进行了详细的分析。在ASR实验中,使用了字符级标记化,因此,使用连接时间分类(CTC)损失进行微调。由于计算的限制,实验进行了大约12小时的训练,5小时的开发和测试数据来自标准的My Science Tutor (MyST)语料库。基线wav2vec 2.0实现了34%的WER,而使用所提出的方法提高了相对10%的性能。在此基础上,详细分析了语言模型的性能损失和影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Noisy Student Teacher Training with Self Supervised Learning for Children ASR
Automatic Speech Recognition (ASR) is a fast-growing field, where reliable systems are made for high resource languages and for adult’s speech. However, performance of such ASR system is inefficient for children speech, due to numerous acoustic variability in children speech and scarcity of resources. In this paper, we propose to use the unlabeled data extensively to develop ASR system for low resourced children speech. State-of-the-art wav2vec 2.0 is the baseline ASR technique used here. The baseline’s performance is further enhanced with the intuition of Noisy Student Teacher (NST) learning. The proposed technique is not only limited to introducing the use of soft labels (i.e., word-level transcription) of unlabeled data, but also adapts the learning of teacher model or preceding student model, which results in reduction of the redundant training significantly. To that effect, a detailed analysis is reported in this paper, as there is a difference in teacher and student learning. In ASR experiments, character-level tokenization was used and hence, Connectionist Temporal Classification (CTC) loss was used for fine-tuning. Due to computational limitations, experiments are performed with approximately 12 hours of training, and 5 hours of development and test data was used from standard My Science Tutor (MyST) corpus. The baseline wav2vec 2.0 achieves 34% WER, while relatively 10% of performance was improved using the proposed approach. Further, the analysis of performance loss and effect of language model is discussed in details.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信