Zihan Zhong;Qianli Wang;Satwinder Singh;Clarion C. Mendes;Mark Hasegawa-Johnson;Waleed Abdulla;Seyed Reza Shahamiri
{"title":"基于卷积增强变压器的独立说话人困难语音识别。","authors":"Zihan Zhong;Qianli Wang;Satwinder Singh;Clarion C. Mendes;Mark Hasegawa-Johnson;Waleed Abdulla;Seyed Reza Shahamiri","doi":"10.1109/TNSRE.2025.3610792","DOIUrl":null,"url":null,"abstract":"Dysarthria is a motor speech disorder characterized by muscle movement difficulties that complicate verbal communication. It poses significant challenges to Automatic Speech Recognition (ASR) systems due to data scarcity and speaker variability among dysarthric individuals. This study investigates speaker-independent (SI) approaches to assist speakers with communication impairments. Firstly, we developed dysarthric SI models using a Conformer-based system and a three-stage transfer-learning pipeline that employs a selective layer freezing PEFT strategy to mitigate data scarcity. We pre-trained on standard speech and progressively adapted the models to two dysarthric datasets, respectively. Secondly, we introduced a benchmark framework for evaluating the generalizability of SI models with cross-dataset validation—a previously unexplored approach in dysarthric ASR, providing a more realistic scenario. The results demonstrate that the proposed dysarthric SI models outperform all baseline systems. Specifically, on the TORGO dataset, our models improved word recognition accuracy by 21.9% for isolated speech and reduced the word error rate by 18.5% for continuous speech. On UA-Speech, our optimal dysarthric SI model achieved a word recognition improvement of 14.6% over Whisper and 28.3% over the base model for isolated speech. Nevertheless, our cross-dataset testing showed that models tended to produce isolated words when asked to transcribe continuous speech for severe dysarthria, highlighting the need to further improve SI generalization.","PeriodicalId":13419,"journal":{"name":"IEEE Transactions on Neural Systems and Rehabilitation Engineering","volume":"33 ","pages":"3815-3826"},"PeriodicalIF":5.2000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11168953","citationCount":"0","resultStr":"{\"title\":\"Convolution-Augmented Transformers for Enhanced Speaker-Independent Dysarthric Speech Recognition\",\"authors\":\"Zihan Zhong;Qianli Wang;Satwinder Singh;Clarion C. Mendes;Mark Hasegawa-Johnson;Waleed Abdulla;Seyed Reza Shahamiri\",\"doi\":\"10.1109/TNSRE.2025.3610792\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dysarthria is a motor speech disorder characterized by muscle movement difficulties that complicate verbal communication. It poses significant challenges to Automatic Speech Recognition (ASR) systems due to data scarcity and speaker variability among dysarthric individuals. This study investigates speaker-independent (SI) approaches to assist speakers with communication impairments. Firstly, we developed dysarthric SI models using a Conformer-based system and a three-stage transfer-learning pipeline that employs a selective layer freezing PEFT strategy to mitigate data scarcity. We pre-trained on standard speech and progressively adapted the models to two dysarthric datasets, respectively. Secondly, we introduced a benchmark framework for evaluating the generalizability of SI models with cross-dataset validation—a previously unexplored approach in dysarthric ASR, providing a more realistic scenario. The results demonstrate that the proposed dysarthric SI models outperform all baseline systems. Specifically, on the TORGO dataset, our models improved word recognition accuracy by 21.9% for isolated speech and reduced the word error rate by 18.5% for continuous speech. On UA-Speech, our optimal dysarthric SI model achieved a word recognition improvement of 14.6% over Whisper and 28.3% over the base model for isolated speech. Nevertheless, our cross-dataset testing showed that models tended to produce isolated words when asked to transcribe continuous speech for severe dysarthria, highlighting the need to further improve SI generalization.\",\"PeriodicalId\":13419,\"journal\":{\"name\":\"IEEE Transactions on Neural Systems and Rehabilitation Engineering\",\"volume\":\"33 \",\"pages\":\"3815-3826\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11168953\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Neural Systems and Rehabilitation Engineering\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11168953/\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, BIOMEDICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Neural Systems and Rehabilitation Engineering","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11168953/","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}
Convolution-Augmented Transformers for Enhanced Speaker-Independent Dysarthric Speech Recognition
Dysarthria is a motor speech disorder characterized by muscle movement difficulties that complicate verbal communication. It poses significant challenges to Automatic Speech Recognition (ASR) systems due to data scarcity and speaker variability among dysarthric individuals. This study investigates speaker-independent (SI) approaches to assist speakers with communication impairments. Firstly, we developed dysarthric SI models using a Conformer-based system and a three-stage transfer-learning pipeline that employs a selective layer freezing PEFT strategy to mitigate data scarcity. We pre-trained on standard speech and progressively adapted the models to two dysarthric datasets, respectively. Secondly, we introduced a benchmark framework for evaluating the generalizability of SI models with cross-dataset validation—a previously unexplored approach in dysarthric ASR, providing a more realistic scenario. The results demonstrate that the proposed dysarthric SI models outperform all baseline systems. Specifically, on the TORGO dataset, our models improved word recognition accuracy by 21.9% for isolated speech and reduced the word error rate by 18.5% for continuous speech. On UA-Speech, our optimal dysarthric SI model achieved a word recognition improvement of 14.6% over Whisper and 28.3% over the base model for isolated speech. Nevertheless, our cross-dataset testing showed that models tended to produce isolated words when asked to transcribe continuous speech for severe dysarthria, highlighting the need to further improve SI generalization.
期刊介绍:
Rehabilitative and neural aspects of biomedical engineering, including functional electrical stimulation, acoustic dynamics, human performance measurement and analysis, nerve stimulation, electromyography, motor control and stimulation; and hardware and software applications for rehabilitation engineering and assistive devices.