Improving the Performance of Zero-Resource Children’s ASR System through Formant and Duration Modification based Data Augmentation

S. Shahnawazuddin, Vinit Kumar, Avinash Kumar, Waquar Ahmad
{"title":"Improving the Performance of Zero-Resource Children’s ASR System through Formant and Duration Modification based Data Augmentation","authors":"S. Shahnawazuddin, Vinit Kumar, Avinash Kumar, Waquar Ahmad","doi":"10.1109/SPCOM55316.2022.9840767","DOIUrl":null,"url":null,"abstract":"Developing an automatic speech recognition (ASR) system for children’s speech is extremely challenging due to the unavailability of data from the child domain for the majority of the languages. Consequently, in such zero-resource scenarios, we are forced to develop an ASR system using adults’ speech for transcribing data from child speakers. However, differences in formant frequencies and speaking-rate between the two groups of speakers degrade recognition performance. To reduce the said mismatch, out-of-domain data augmentation approaches based on formant and duration modification are proposed in this work. For that purpose, formant frequencies of adults’ speech training data are up-scaled using warping of linear predictive coding coefficients. Next, the speaking-rate of adults’ data is also increased through time-scale modification. Due to simultaneous altering of formant frequencies and duration of adults’ speech and then pooling the modified data into training, the acoustic mismatch due to the aforementioned factors gets reduced. This, in turn, enhances the recognition performance significantly. Additional improvement is obtained by combining the recently reported voice-conversion-based data augmentation technique with the proposed ones. On combining the proposed and voice-conversion-based data augmentation techniques, a relative reduction of nearly 32.3% in word error rate over the baseline is obtained.","PeriodicalId":246982,"journal":{"name":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM55316.2022.9840767","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Developing an automatic speech recognition (ASR) system for children’s speech is extremely challenging due to the unavailability of data from the child domain for the majority of the languages. Consequently, in such zero-resource scenarios, we are forced to develop an ASR system using adults’ speech for transcribing data from child speakers. However, differences in formant frequencies and speaking-rate between the two groups of speakers degrade recognition performance. To reduce the said mismatch, out-of-domain data augmentation approaches based on formant and duration modification are proposed in this work. For that purpose, formant frequencies of adults’ speech training data are up-scaled using warping of linear predictive coding coefficients. Next, the speaking-rate of adults’ data is also increased through time-scale modification. Due to simultaneous altering of formant frequencies and duration of adults’ speech and then pooling the modified data into training, the acoustic mismatch due to the aforementioned factors gets reduced. This, in turn, enhances the recognition performance significantly. Additional improvement is obtained by combining the recently reported voice-conversion-based data augmentation technique with the proposed ones. On combining the proposed and voice-conversion-based data augmentation techniques, a relative reduction of nearly 32.3% in word error rate over the baseline is obtained.
基于形成峰和持续时间修改的数据增强提高零资源儿童ASR系统的性能
由于大多数语言的儿童领域数据不可用,开发儿童语音自动识别(ASR)系统极具挑战性。因此,在这种零资源的情况下,我们被迫开发一个ASR系统,使用成人的语言来转录儿童说话者的数据。然而,两组说话者在共振频率和语速上的差异会降低识别性能。为了减少这种不匹配,本文提出了基于形成峰和持续时间修改的域外数据增强方法。为此,使用线性预测编码系数的扭曲来放大成人语音训练数据的形成峰频率。其次,通过时间尺度修正,也提高了成人数据的说话率。通过同时改变成人语音的共振峰频率和持续时间,然后将修改后的数据汇集到训练中,减少了由于上述因素造成的声学失配。这反过来又大大提高了识别性能。将最近报道的基于语音转换的数据增强技术与所提出的数据增强技术相结合,获得了额外的改进。将所提出的方法与基于语音转换的数据增强技术相结合,在基线的基础上,错误率相对降低了近32.3%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信