{"title":"Collaborative Training of Acoustic Encoder for Recognizing the Impaired Children Speech","authors":"S. Shareef, Yusra Faisal Mohammed","doi":"10.1109/CSCTIT56299.2022.10145742","DOIUrl":null,"url":null,"abstract":"Encoder-decoder models have become an effective approach and are increasingly popular for sequence learning tasks like automatic speech recognition (ASR) due to their simplified processing stages. The traditional decoder-based approach usually learns a sequence-to-sequence mapping function from the source speech to target units (character-level, word-level, or phoneme-level). However, without sufficient and undefective training data it still has degraded performance. In a situation like impaired children's speech recognition, the low and disordered information within the audio signal due to impaired pronunciation requires processing with different levels to improve the model's ability. Therefore, this work adopts a particular encoder's structure that can improve the encoder's ability for encodes the input sequence into a sequence of hidden representations. This paper proposes a collaborative training of Sequence predictive models as an acoustic encoder for ASR systems for Arabic impaired children's speech. Each sequence model maps the input sequence to only one phoneme from the output sequence. Then, the encoder alignment the output phonemes within one output sequence. Experimental results on the Impaired Children's Speech of Arabic dataset show the collaboratively trained acoustic encoder to align phonemes sequence can provide up to a 10% relative improvement in accuracy compared to traditional methods. Especially, since this paper is the first in the area of recognizing impaired children's speech at the level of Arabic texts.","PeriodicalId":243635,"journal":{"name":"2022 Fifth College of Science International Conference of Recent Trends in Information Technology (CSCTIT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Fifth College of Science International Conference of Recent Trends in Information Technology (CSCTIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CSCTIT56299.2022.10145742","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Encoder-decoder models have become an effective approach and are increasingly popular for sequence learning tasks like automatic speech recognition (ASR) due to their simplified processing stages. The traditional decoder-based approach usually learns a sequence-to-sequence mapping function from the source speech to target units (character-level, word-level, or phoneme-level). However, without sufficient and undefective training data it still has degraded performance. In a situation like impaired children's speech recognition, the low and disordered information within the audio signal due to impaired pronunciation requires processing with different levels to improve the model's ability. Therefore, this work adopts a particular encoder's structure that can improve the encoder's ability for encodes the input sequence into a sequence of hidden representations. This paper proposes a collaborative training of Sequence predictive models as an acoustic encoder for ASR systems for Arabic impaired children's speech. Each sequence model maps the input sequence to only one phoneme from the output sequence. Then, the encoder alignment the output phonemes within one output sequence. Experimental results on the Impaired Children's Speech of Arabic dataset show the collaboratively trained acoustic encoder to align phonemes sequence can provide up to a 10% relative improvement in accuracy compared to traditional methods. Especially, since this paper is the first in the area of recognizing impaired children's speech at the level of Arabic texts.