Zhenglin Zhang , Tengfei Wang , Zian Hu , Li-Zhuang Yang , Hai Li
{"title":"Multivariate time series approach integrating cross-temporal and cross-channel attention for dysarthria detection from speech","authors":"Zhenglin Zhang , Tengfei Wang , Zian Hu , Li-Zhuang Yang , Hai Li","doi":"10.1016/j.neucom.2025.130708","DOIUrl":null,"url":null,"abstract":"<div><div>Speech analysis offers a non-invasive, low-cost approach to dysarthria detection. Studies have shown that the temporal correlations within speech signals and the interactions among the multidimensional feature variables derived from them can facilitate dysarthria detection. However, current studies either rely on pre-designed feature sets, which depend heavily on cumbersome feature engineering, or focus solely on spectral or high-dimensional audio vectors that capture temporal dependencies while neglecting the interactions between internal multivariate features. We propose an end-to-end method that utilizes audio pre-trained models as multivariate time series feature extractors, combined with InceptionTime and cross-temporal and cross-channel attention mechanisms, to fully capture temporal dependencies and interactions among variables within speech for accurate dysarthria detection. Results show that the proposed method achieves a detection accuracy of 92.06 % on a local Mandarin dysarthria dataset, which is at least 2.17 percentage points higher than previous studies, with the highest stability and the lowest time cost. Furthermore, it achieves an accuracy of 87.73 % on an external English dataset, demonstrating good cross-linguistic adaptability and generalizability. Additionally, experiments show that in connected speech tasks, structured tasks outperform unstructured ones in leveraging interactions, leading to more effective dysarthria detection. These findings validate the effectiveness of the proposed end-to-end dysarthria detection method, further advancing the development of speech analysis as a promising tool for dysarthria screening.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"647 ","pages":"Article 130708"},"PeriodicalIF":5.5000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225013803","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Speech analysis offers a non-invasive, low-cost approach to dysarthria detection. Studies have shown that the temporal correlations within speech signals and the interactions among the multidimensional feature variables derived from them can facilitate dysarthria detection. However, current studies either rely on pre-designed feature sets, which depend heavily on cumbersome feature engineering, or focus solely on spectral or high-dimensional audio vectors that capture temporal dependencies while neglecting the interactions between internal multivariate features. We propose an end-to-end method that utilizes audio pre-trained models as multivariate time series feature extractors, combined with InceptionTime and cross-temporal and cross-channel attention mechanisms, to fully capture temporal dependencies and interactions among variables within speech for accurate dysarthria detection. Results show that the proposed method achieves a detection accuracy of 92.06 % on a local Mandarin dysarthria dataset, which is at least 2.17 percentage points higher than previous studies, with the highest stability and the lowest time cost. Furthermore, it achieves an accuracy of 87.73 % on an external English dataset, demonstrating good cross-linguistic adaptability and generalizability. Additionally, experiments show that in connected speech tasks, structured tasks outperform unstructured ones in leveraging interactions, leading to more effective dysarthria detection. These findings validate the effectiveness of the proposed end-to-end dysarthria detection method, further advancing the development of speech analysis as a promising tool for dysarthria screening.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.