{"title":"Fusing Time-Frequency Heterogeneous Features With Cross-Attention Mechanism for Pathological Voice Detection.","authors":"Zhang Jiaqing, Wu Yaqin, Zhang Tao","doi":"10.1016/j.jvoice.2025.09.017","DOIUrl":null,"url":null,"abstract":"<p><p>To address the critical challenges of data scarcity, feature homogenization, and limited model generalization in current pathological voice diagnosis systems, a novel algorithm was developed to integrate time-frequency heterogeneous acoustic features for multi-class pathological voice detection. The Wav2vec2-XLSR model pretrained through self-supervised learning was first employed to extract deep contextual features from time-domain voice signals. Mel-Frequency Cepstral Coefficients (MFCC) features from the frequency domain were subsequently integrated to construct a heterogeneous vocal feature space. A cross-attention mechanism from the Transformer architecture was innovatively applied to achieve dynamic spatiotemporal alignment and semantic interaction within the heterogeneous feature space, enabling complementary feature enhancement. A dual-granularity joint analysis framework encompassing vowel and sentence hierarchies was ultimately established for efficient multi-type pathological voice detection. Experimental results demonstrated that the proposed algorithm achieved 95.1% accuracy, 100% recall, 0.92 F1-score, and 0.97 area under the ROC curve (AUC) value on sentence-level Saarbruecken Voice Database (SVD) dataset. For vowel-level classification tasks, classification accuracies of 100% and 99.6% were obtained on the Massachusetts Eye and Ear Infirmary (MEEI) and SVD datasets, respectively. Multi-corpus evaluation experiments confirm the algorithm's robustness and generalization capability across different data distributions.</p>","PeriodicalId":49954,"journal":{"name":"Journal of Voice","volume":" ","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Voice","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jvoice.2025.09.017","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
To address the critical challenges of data scarcity, feature homogenization, and limited model generalization in current pathological voice diagnosis systems, a novel algorithm was developed to integrate time-frequency heterogeneous acoustic features for multi-class pathological voice detection. The Wav2vec2-XLSR model pretrained through self-supervised learning was first employed to extract deep contextual features from time-domain voice signals. Mel-Frequency Cepstral Coefficients (MFCC) features from the frequency domain were subsequently integrated to construct a heterogeneous vocal feature space. A cross-attention mechanism from the Transformer architecture was innovatively applied to achieve dynamic spatiotemporal alignment and semantic interaction within the heterogeneous feature space, enabling complementary feature enhancement. A dual-granularity joint analysis framework encompassing vowel and sentence hierarchies was ultimately established for efficient multi-type pathological voice detection. Experimental results demonstrated that the proposed algorithm achieved 95.1% accuracy, 100% recall, 0.92 F1-score, and 0.97 area under the ROC curve (AUC) value on sentence-level Saarbruecken Voice Database (SVD) dataset. For vowel-level classification tasks, classification accuracies of 100% and 99.6% were obtained on the Massachusetts Eye and Ear Infirmary (MEEI) and SVD datasets, respectively. Multi-corpus evaluation experiments confirm the algorithm's robustness and generalization capability across different data distributions.
期刊介绍:
The Journal of Voice is widely regarded as the world''s premiere journal for voice medicine and research. This peer-reviewed publication is listed in Index Medicus and is indexed by the Institute for Scientific Information. The journal contains articles written by experts throughout the world on all topics in voice sciences, voice medicine and surgery, and speech-language pathologists'' management of voice-related problems. The journal includes clinical articles, clinical research, and laboratory research. Members of the Foundation receive the journal as a benefit of membership.