Collaborative AI Dysarthric Speech Recognition System with Data Augmentation using Generative Adversarial Neural Network.

IF 4.8 2区 医学 Q2 ENGINEERING, BIOMEDICAL
Yibo He, Kah Phooi Seng, Li Minn Ang
{"title":"Collaborative AI Dysarthric Speech Recognition System with Data Augmentation using Generative Adversarial Neural Network.","authors":"Yibo He, Kah Phooi Seng, Li Minn Ang","doi":"10.1109/TNSRE.2025.3570383","DOIUrl":null,"url":null,"abstract":"<p><p>This paper proposes a novel collaborative dysarthric speech recognition system designed to convert dysarthric speech into non-dysarthric speech to enhance the robustness of automatic speech recognition (ASR) systems fine-tuned for dysarthric speech. The system employs an innovative three-stage data augmentation framework: The first stage collaboratively augments the training dataset by generating static data and high-quality synthetic speech samples using a natural text-to-speech model (Tacotron2). The second stage applies a tempo perturbation technique that simulates the natural variation of speech rhythms by adjusting the playback tempo to improve the model's adaptability to varying speech speeds. The third stage integrates the Inception-ResNet module with a temporal masking strategy using an enhanced CycleGAN-based conversion model to efficiently map conformal and non-conformal phonological features while preserving the overall speech structure and resolving temporal irregularities. Experiments conducted on the UASpeech corpus demonstrate a significant reduction in the word error rate (WER) compared to the baseline approach. Specifically, the three-stage data enhancement process achieves a reduction in the WER for the fine-tuned Wav2Vec2-XLSR and Whisper-Tiny models by 9.81% and 6.56%, respectively, with an average WER of 13.58% for the best performing system. These results highlight the effectiveness of the collaborative framework in improving the accuracy and naturalness of speech recognition for dysarthria, thereby offering individuals with dysarthria a more natural and intelligible communication experience.</p>","PeriodicalId":13419,"journal":{"name":"IEEE Transactions on Neural Systems and Rehabilitation Engineering","volume":"PP ","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Neural Systems and Rehabilitation Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/TNSRE.2025.3570383","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0

Abstract

This paper proposes a novel collaborative dysarthric speech recognition system designed to convert dysarthric speech into non-dysarthric speech to enhance the robustness of automatic speech recognition (ASR) systems fine-tuned for dysarthric speech. The system employs an innovative three-stage data augmentation framework: The first stage collaboratively augments the training dataset by generating static data and high-quality synthetic speech samples using a natural text-to-speech model (Tacotron2). The second stage applies a tempo perturbation technique that simulates the natural variation of speech rhythms by adjusting the playback tempo to improve the model's adaptability to varying speech speeds. The third stage integrates the Inception-ResNet module with a temporal masking strategy using an enhanced CycleGAN-based conversion model to efficiently map conformal and non-conformal phonological features while preserving the overall speech structure and resolving temporal irregularities. Experiments conducted on the UASpeech corpus demonstrate a significant reduction in the word error rate (WER) compared to the baseline approach. Specifically, the three-stage data enhancement process achieves a reduction in the WER for the fine-tuned Wav2Vec2-XLSR and Whisper-Tiny models by 9.81% and 6.56%, respectively, with an average WER of 13.58% for the best performing system. These results highlight the effectiveness of the collaborative framework in improving the accuracy and naturalness of speech recognition for dysarthria, thereby offering individuals with dysarthria a more natural and intelligible communication experience.

基于生成对抗神经网络的数据增强协同AI困难语音识别系统。
本文提出了一种新的协作式困难语音识别系统,旨在将困难语音转换为非困难语音,以增强针对困难语音进行微调的自动语音识别系统的鲁棒性。该系统采用了创新的三阶段数据增强框架:第一阶段通过使用自然文本到语音模型(Tacotron2)生成静态数据和高质量合成语音样本,协同增强训练数据集。第二阶段采用节奏扰动技术,通过调整回放速度来模拟语音节奏的自然变化,以提高模型对不同语音速度的适应性。第三阶段将Inception-ResNet模块与时间屏蔽策略集成,使用增强的基于cyclegan的转换模型有效地映射共形和非共形语音特征,同时保留整体语音结构并解决时间不规则性。在uasspeech语料库上进行的实验表明,与基线方法相比,该方法显著降低了单词错误率。具体来说,三个阶段的数据增强过程使经过微调的Wav2Vec2-XLSR和Whisper-Tiny模型的WER分别降低了9.81%和6.56%,表现最好的系统的平均WER为13.58%。这些结果突出了协作框架在提高构音障碍语音识别的准确性和自然度方面的有效性,从而为构音障碍患者提供了更自然和可理解的交流体验。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.60
自引率
8.20%
发文量
479
审稿时长
6-12 weeks
期刊介绍: Rehabilitative and neural aspects of biomedical engineering, including functional electrical stimulation, acoustic dynamics, human performance measurement and analysis, nerve stimulation, electromyography, motor control and stimulation; and hardware and software applications for rehabilitation engineering and assistive devices.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信