End-to-end neural automatic speech recognition system for low resource languages

IF 5 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Egyptian Informatics Journal Pub Date : 2025-01-28 DOI:10.1016/j.eij.2025.100615

Sami Dhahbi , Nasir Saleem , Sami Bourouis , Mouhebeddine Berrima , Elena Verdú

{"title":"End-to-end neural automatic speech recognition system for low resource languages","authors":"Sami Dhahbi , Nasir Saleem , Sami Bourouis , Mouhebeddine Berrima , Elena Verdú","doi":"10.1016/j.eij.2025.100615","DOIUrl":null,"url":null,"abstract":"<div><div>The rising popularity of end-to-end (E2E) automatic speech recognition (ASR) systems can be attributed to their ability to learn complex speech patterns directly from raw data, eliminating the need for intricate feature extraction pipelines and handcrafted language models. E2E-ASR systems have consistently outperformed traditional ASRs. However, training E2E-ASR systems for low-resource languages remains challenging due to the dependence on data from well-resourced languages. ASR is vital for promoting under-resourced languages, especially in developing human-to-human and human-to-machine communication systems. Using synthetic speech and data augmentation techniques can enhance E2E-ASR performance for low-resource languages, reducing word error rates (WERs) and character error rates (CERs). This study leverages a non-autoregressive neural text-to-speech (TTS) engine to generate high-quality speech, converting a series of phonemes into speech waveforms (mel-spectrograms). An on-the-fly data augmentation method is applied to these mel-spectrograms, treating them as images from which features are extracted to train a convolutional neural network (CNN) and a bidirectional long short-term memory (BLSTM)-based ASR. The E2E architecture of this system achieves optimal WER and CER performance. The proposed deep learning-based E2E-ASR, trained with synthetic speech and data augmentation, shows significant performance improvements, with a 20.75% reduction in WERs and a 10.34% reduction in CERs.</div></div>","PeriodicalId":56010,"journal":{"name":"Egyptian Informatics Journal","volume":"29 ","pages":"Article 100615"},"PeriodicalIF":5.0000,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Egyptian Informatics Journal","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1110866525000088","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The rising popularity of end-to-end (E2E) automatic speech recognition (ASR) systems can be attributed to their ability to learn complex speech patterns directly from raw data, eliminating the need for intricate feature extraction pipelines and handcrafted language models. E2E-ASR systems have consistently outperformed traditional ASRs. However, training E2E-ASR systems for low-resource languages remains challenging due to the dependence on data from well-resourced languages. ASR is vital for promoting under-resourced languages, especially in developing human-to-human and human-to-machine communication systems. Using synthetic speech and data augmentation techniques can enhance E2E-ASR performance for low-resource languages, reducing word error rates (WERs) and character error rates (CERs). This study leverages a non-autoregressive neural text-to-speech (TTS) engine to generate high-quality speech, converting a series of phonemes into speech waveforms (mel-spectrograms). An on-the-fly data augmentation method is applied to these mel-spectrograms, treating them as images from which features are extracted to train a convolutional neural network (CNN) and a bidirectional long short-term memory (BLSTM)-based ASR. The E2E architecture of this system achieves optimal WER and CER performance. The proposed deep learning-based E2E-ASR, trained with synthetic speech and data augmentation, shows significant performance improvements, with a 20.75% reduction in WERs and a 10.34% reduction in CERs.

查看原文本刊更多论文

低资源语言的端到端神经自动语音识别系统

端到端（E2E）自动语音识别（ASR）系统的日益普及可归因于它们能够直接从原始数据中学习复杂的语音模式，从而消除了对复杂的特征提取管道和手工语言模型的需求。E2E-ASR系统的性能一直优于传统的asr系统。然而，由于依赖于资源丰富的语言的数据，为低资源语言培训E2E-ASR系统仍然具有挑战性。ASR对于促进资源不足的语言至关重要，特别是在开发人与人之间和人与人之间的通信系统方面。使用合成语音和数据增强技术可以提高低资源语言的E2E-ASR性能，降低单词错误率（wer）和字符错误率（CERs）。本研究利用非自回归神经文本到语音（TTS）引擎生成高质量语音，将一系列音素转换为语音波形（mel-谱图）。将动态数据增强方法应用于这些mel谱图，将其作为图像处理，从中提取特征以训练卷积神经网络（CNN）和基于双向长短期记忆（BLSTM）的ASR。该系统的端到端架构实现了最优的WER和CER性能。采用合成语音和数据增强训练的基于深度学习的E2E-ASR显示出显著的性能改进，wer降低了20.75%，cer降低了10.34%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Egyptian Informatics Journal Decision Sciences-Management Science and Operations Research

CiteScore

11.10

自引率

1.90%

发文量

审稿时长

110 days

期刊介绍： The Egyptian Informatics Journal is published by the Faculty of Computers and Artificial Intelligence, Cairo University. This Journal provides a forum for the state-of-the-art research and development in the fields of computing, including computer sciences, information technologies, information systems, operations research and decision support. Innovative and not-previously-published work in subjects covered by the Journal is encouraged to be submitted, whether from academic, research or commercial sources.