{"title":"Collaborative AI Dysarthric Speech Recognition System with Data Augmentation using Generative Adversarial Neural Network.","authors":"Yibo He, Kah Phooi Seng, Li Minn Ang","doi":"10.1109/TNSRE.2025.3570383","DOIUrl":null,"url":null,"abstract":"<p><p>This paper proposes a novel collaborative dysarthric speech recognition system designed to convert dysarthric speech into non-dysarthric speech to enhance the robustness of automatic speech recognition (ASR) systems fine-tuned for dysarthric speech. The system employs an innovative three-stage data augmentation framework: The first stage collaboratively augments the training dataset by generating static data and high-quality synthetic speech samples using a natural text-to-speech model (Tacotron2). The second stage applies a tempo perturbation technique that simulates the natural variation of speech rhythms by adjusting the playback tempo to improve the model's adaptability to varying speech speeds. The third stage integrates the Inception-ResNet module with a temporal masking strategy using an enhanced CycleGAN-based conversion model to efficiently map conformal and non-conformal phonological features while preserving the overall speech structure and resolving temporal irregularities. Experiments conducted on the UASpeech corpus demonstrate a significant reduction in the word error rate (WER) compared to the baseline approach. Specifically, the three-stage data enhancement process achieves a reduction in the WER for the fine-tuned Wav2Vec2-XLSR and Whisper-Tiny models by 9.81% and 6.56%, respectively, with an average WER of 13.58% for the best performing system. These results highlight the effectiveness of the collaborative framework in improving the accuracy and naturalness of speech recognition for dysarthria, thereby offering individuals with dysarthria a more natural and intelligible communication experience.</p>","PeriodicalId":13419,"journal":{"name":"IEEE Transactions on Neural Systems and Rehabilitation Engineering","volume":"PP ","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Neural Systems and Rehabilitation Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1109/TNSRE.2025.3570383","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0
Abstract
This paper proposes a novel collaborative dysarthric speech recognition system designed to convert dysarthric speech into non-dysarthric speech to enhance the robustness of automatic speech recognition (ASR) systems fine-tuned for dysarthric speech. The system employs an innovative three-stage data augmentation framework: The first stage collaboratively augments the training dataset by generating static data and high-quality synthetic speech samples using a natural text-to-speech model (Tacotron2). The second stage applies a tempo perturbation technique that simulates the natural variation of speech rhythms by adjusting the playback tempo to improve the model's adaptability to varying speech speeds. The third stage integrates the Inception-ResNet module with a temporal masking strategy using an enhanced CycleGAN-based conversion model to efficiently map conformal and non-conformal phonological features while preserving the overall speech structure and resolving temporal irregularities. Experiments conducted on the UASpeech corpus demonstrate a significant reduction in the word error rate (WER) compared to the baseline approach. Specifically, the three-stage data enhancement process achieves a reduction in the WER for the fine-tuned Wav2Vec2-XLSR and Whisper-Tiny models by 9.81% and 6.56%, respectively, with an average WER of 13.58% for the best performing system. These results highlight the effectiveness of the collaborative framework in improving the accuracy and naturalness of speech recognition for dysarthria, thereby offering individuals with dysarthria a more natural and intelligible communication experience.
期刊介绍:
Rehabilitative and neural aspects of biomedical engineering, including functional electrical stimulation, acoustic dynamics, human performance measurement and analysis, nerve stimulation, electromyography, motor control and stimulation; and hardware and software applications for rehabilitation engineering and assistive devices.