Self-Supervised ASR Models and Features for Dysarthric and Elderly Speech Recognition

IF 4.1 2区 计算机科学 Q1 ACOUSTICS
Shujie Hu;Xurong Xie;Mengzhe Geng;Zengrui Jin;Jiajun Deng;Guinan Li;Yi Wang;Mingyu Cui;Tianzi Wang;Helen Meng;Xunying Liu
{"title":"Self-Supervised ASR Models and Features for Dysarthric and Elderly Speech Recognition","authors":"Shujie Hu;Xurong Xie;Mengzhe Geng;Zengrui Jin;Jiajun Deng;Guinan Li;Yi Wang;Mingyu Cui;Tianzi Wang;Helen Meng;Xunying Liu","doi":"10.1109/TASLP.2024.3422839","DOIUrl":null,"url":null,"abstract":"Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition. These include: a) input feature fusion between standard acoustic frontends and domain fine-tuned SSL speech representations; b) frame-level joint decoding between TDNN systems separately trained using standard acoustic features alone and those with additional domain fine-tuned SSL features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain fine-tuned pre-trained ASR models. In addition, fine-tuned SSL speech features are used in acoustic-to-articulatory (A2A) inversion to construct multi-modal ASR systems. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models and their features consistently outperform the standalone fine-tuned SSL pre-trained models. These systems produced statistically significant WER or CER reductions of \n<bold>6.53%</b>\n, \n<bold>1.90%</b>\n, \n<bold>2.04%</b>\n and \n<bold>7.97%</b>\n absolute (\n<bold>24.10%</b>\n, \n<bold>23.84%</b>\n, \n<bold>10.14%</b>\n and \n<bold>31.39%</b>\n relative) on the four tasks respectively. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3561-3575"},"PeriodicalIF":4.1000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10584335","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10584335/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition. These include: a) input feature fusion between standard acoustic frontends and domain fine-tuned SSL speech representations; b) frame-level joint decoding between TDNN systems separately trained using standard acoustic features alone and those with additional domain fine-tuned SSL features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain fine-tuned pre-trained ASR models. In addition, fine-tuned SSL speech features are used in acoustic-to-articulatory (A2A) inversion to construct multi-modal ASR systems. Experiments are conducted on four tasks: the English UASpeech and TORGO dysarthric speech corpora; and the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets. The TDNN systems constructed by integrating domain-adapted HuBERT, wav2vec2-conformer or multi-lingual XLSR models and their features consistently outperform the standalone fine-tuned SSL pre-trained models. These systems produced statistically significant WER or CER reductions of 6.53% , 1.90% , 2.04% and 7.97% absolute ( 24.10% , 23.84% , 10.14% and 31.39% relative) on the four tasks respectively. Consistent improvements in Alzheimer's Disease detection accuracy are also obtained using the DementiaBank Pitt elderly speech recognition outputs.
用于肢体障碍和老年人语音识别的自监督 ASR 模型和特征
基于自监督学习(SSL)的语音基础模型已被广泛应用于 ASR 任务。然而,通过数据密集型参数微调将其应用于听力障碍和老年语音时,却面临着领域内数据稀缺和不匹配的问题。为此,本文探索了一系列方法,将领域微调的 SSL 预训练模型及其特征整合到 TDNN 和 Conformer ASR 系统中,用于肢体障碍和老年语音识别。这些方法包括:a) 在标准声学前端和领域微调 SSL 语音表示之间进行输入特征融合;b) 在单独使用标准声学特征训练的 TDNN 系统和使用额外的领域微调 SSL 特征训练的 TDNN 系统之间进行帧级联合解码;c) 使用领域微调预训练 ASR 模型对 TDNN/C Conformer 系统输出重新进行多路解码。此外,经过微调的 SSL 语音特征还被用于声学到发音(A2A)反转,以构建多模态 ASR 系统。实验在四个任务上进行:英语 UASpeech 和 TORGO 听觉障碍语音库;英语 DementiaBank Pitt 和广东话 JCCOCC MoCA 老年语音数据集。通过整合与领域相适应的 HuBERT、wav2vec2-conformer 或多语言 XLSR 模型及其特征而构建的 TDNN 系统始终优于独立的微调 SSL 预训练模型。在四项任务中,这些系统的 WER 或 CER 绝对值分别降低了 6.53%、1.90%、2.04% 和 7.97%(相对值分别降低了 24.10%、23.84%、10.14% 和 31.39%),具有显著的统计学意义。使用 DementiaBank Pitt 老年人语音识别输出,阿尔茨海默病的检测准确率也得到了持续提高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC
CiteScore
11.30
自引率
11.10%
发文量
217
期刊介绍: The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信