Convolution-Augmented Transformers for Enhanced Speaker-Independent Dysarthric Speech Recognition

IF 5.2 2区 医学 Q2 ENGINEERING, BIOMEDICAL
Zihan Zhong;Qianli Wang;Satwinder Singh;Clarion C. Mendes;Mark Hasegawa-Johnson;Waleed Abdulla;Seyed Reza Shahamiri
{"title":"Convolution-Augmented Transformers for Enhanced Speaker-Independent Dysarthric Speech Recognition","authors":"Zihan Zhong;Qianli Wang;Satwinder Singh;Clarion C. Mendes;Mark Hasegawa-Johnson;Waleed Abdulla;Seyed Reza Shahamiri","doi":"10.1109/TNSRE.2025.3610792","DOIUrl":null,"url":null,"abstract":"Dysarthria is a motor speech disorder characterized by muscle movement difficulties that complicate verbal communication. It poses significant challenges to Automatic Speech Recognition (ASR) systems due to data scarcity and speaker variability among dysarthric individuals. This study investigates speaker-independent (SI) approaches to assist speakers with communication impairments. Firstly, we developed dysarthric SI models using a Conformer-based system and a three-stage transfer-learning pipeline that employs a selective layer freezing PEFT strategy to mitigate data scarcity. We pre-trained on standard speech and progressively adapted the models to two dysarthric datasets, respectively. Secondly, we introduced a benchmark framework for evaluating the generalizability of SI models with cross-dataset validation—a previously unexplored approach in dysarthric ASR, providing a more realistic scenario. The results demonstrate that the proposed dysarthric SI models outperform all baseline systems. Specifically, on the TORGO dataset, our models improved word recognition accuracy by 21.9% for isolated speech and reduced the word error rate by 18.5% for continuous speech. On UA-Speech, our optimal dysarthric SI model achieved a word recognition improvement of 14.6% over Whisper and 28.3% over the base model for isolated speech. Nevertheless, our cross-dataset testing showed that models tended to produce isolated words when asked to transcribe continuous speech for severe dysarthria, highlighting the need to further improve SI generalization.","PeriodicalId":13419,"journal":{"name":"IEEE Transactions on Neural Systems and Rehabilitation Engineering","volume":"33 ","pages":"3815-3826"},"PeriodicalIF":5.2000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11168953","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Neural Systems and Rehabilitation Engineering","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11168953/","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Dysarthria is a motor speech disorder characterized by muscle movement difficulties that complicate verbal communication. It poses significant challenges to Automatic Speech Recognition (ASR) systems due to data scarcity and speaker variability among dysarthric individuals. This study investigates speaker-independent (SI) approaches to assist speakers with communication impairments. Firstly, we developed dysarthric SI models using a Conformer-based system and a three-stage transfer-learning pipeline that employs a selective layer freezing PEFT strategy to mitigate data scarcity. We pre-trained on standard speech and progressively adapted the models to two dysarthric datasets, respectively. Secondly, we introduced a benchmark framework for evaluating the generalizability of SI models with cross-dataset validation—a previously unexplored approach in dysarthric ASR, providing a more realistic scenario. The results demonstrate that the proposed dysarthric SI models outperform all baseline systems. Specifically, on the TORGO dataset, our models improved word recognition accuracy by 21.9% for isolated speech and reduced the word error rate by 18.5% for continuous speech. On UA-Speech, our optimal dysarthric SI model achieved a word recognition improvement of 14.6% over Whisper and 28.3% over the base model for isolated speech. Nevertheless, our cross-dataset testing showed that models tended to produce isolated words when asked to transcribe continuous speech for severe dysarthria, highlighting the need to further improve SI generalization.
基于卷积增强变压器的独立说话人困难语音识别。
构音障碍是一种以肌肉运动困难为特征的运动语言障碍,使语言交流复杂化。由于数据稀缺和说话人的差异,这给自动语音识别(ASR)系统带来了重大挑战。本研究探讨独立说话人(SI)的方法来协助有沟通障碍的说话人。首先,我们使用基于一致性的系统和采用选择性层冻结PEFT策略的三阶段迁移学习管道开发了dysarthric SI模型,以减轻数据稀缺性。我们在标准语音上进行了预训练,并逐步将模型分别适应于两个dysarthric数据集。其次,我们引入了一个基准框架,用于评估具有跨数据集验证的SI模型的泛化性,这是一种在逆境ASR中尚未探索的方法,提供了更现实的场景。结果表明,所提出的dysarthric SI模型优于所有基线系统。具体来说,在TORGO数据集上,我们的模型将孤立语音的单词识别准确率提高了21.9%,将连续语音的单词识别错误率降低了18.5%。在UA-Speech上,我们的最优dysarthric SI模型在孤立语音上的单词识别比Whisper提高了14.6%,比基本模型提高了28.3%。然而,我们的交叉数据集测试表明,当被要求转录严重构音障碍的连续语音时,模型倾向于产生孤立的单词,这突出了进一步提高SI泛化的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.60
自引率
8.20%
发文量
479
审稿时长
6-12 weeks
期刊介绍: Rehabilitative and neural aspects of biomedical engineering, including functional electrical stimulation, acoustic dynamics, human performance measurement and analysis, nerve stimulation, electromyography, motor control and stimulation; and hardware and software applications for rehabilitation engineering and assistive devices.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信