Fusing Time-Frequency Heterogeneous Features With Cross-Attention Mechanism for Pathological Voice Detection.

IF 2.4 4区医学 Q1 AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY

Journal of Voice Pub Date : 2025-10-01 DOI:10.1016/j.jvoice.2025.09.017

Zhang Jiaqing, Wu Yaqin, Zhang Tao

{"title":"Fusing Time-Frequency Heterogeneous Features With Cross-Attention Mechanism for Pathological Voice Detection.","authors":"Zhang Jiaqing, Wu Yaqin, Zhang Tao","doi":"10.1016/j.jvoice.2025.09.017","DOIUrl":null,"url":null,"abstract":"<p><p>To address the critical challenges of data scarcity, feature homogenization, and limited model generalization in current pathological voice diagnosis systems, a novel algorithm was developed to integrate time-frequency heterogeneous acoustic features for multi-class pathological voice detection. The Wav2vec2-XLSR model pretrained through self-supervised learning was first employed to extract deep contextual features from time-domain voice signals. Mel-Frequency Cepstral Coefficients (MFCC) features from the frequency domain were subsequently integrated to construct a heterogeneous vocal feature space. A cross-attention mechanism from the Transformer architecture was innovatively applied to achieve dynamic spatiotemporal alignment and semantic interaction within the heterogeneous feature space, enabling complementary feature enhancement. A dual-granularity joint analysis framework encompassing vowel and sentence hierarchies was ultimately established for efficient multi-type pathological voice detection. Experimental results demonstrated that the proposed algorithm achieved 95.1% accuracy, 100% recall, 0.92 F1-score, and 0.97 area under the ROC curve (AUC) value on sentence-level Saarbruecken Voice Database (SVD) dataset. For vowel-level classification tasks, classification accuracies of 100% and 99.6% were obtained on the Massachusetts Eye and Ear Infirmary (MEEI) and SVD datasets, respectively. Multi-corpus evaluation experiments confirm the algorithm's robustness and generalization capability across different data distributions.</p>","PeriodicalId":49954,"journal":{"name":"Journal of Voice","volume":" ","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Voice","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.jvoice.2025.09.017","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUDIOLOGY & SPEECH-LANGUAGE PATHOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

To address the critical challenges of data scarcity, feature homogenization, and limited model generalization in current pathological voice diagnosis systems, a novel algorithm was developed to integrate time-frequency heterogeneous acoustic features for multi-class pathological voice detection. The Wav2vec2-XLSR model pretrained through self-supervised learning was first employed to extract deep contextual features from time-domain voice signals. Mel-Frequency Cepstral Coefficients (MFCC) features from the frequency domain were subsequently integrated to construct a heterogeneous vocal feature space. A cross-attention mechanism from the Transformer architecture was innovatively applied to achieve dynamic spatiotemporal alignment and semantic interaction within the heterogeneous feature space, enabling complementary feature enhancement. A dual-granularity joint analysis framework encompassing vowel and sentence hierarchies was ultimately established for efficient multi-type pathological voice detection. Experimental results demonstrated that the proposed algorithm achieved 95.1% accuracy, 100% recall, 0.92 F1-score, and 0.97 area under the ROC curve (AUC) value on sentence-level Saarbruecken Voice Database (SVD) dataset. For vowel-level classification tasks, classification accuracies of 100% and 99.6% were obtained on the Massachusetts Eye and Ear Infirmary (MEEI) and SVD datasets, respectively. Multi-corpus evaluation experiments confirm the algorithm's robustness and generalization capability across different data distributions.

查看原文本刊更多论文

融合时频异质特征与交叉注意机制的病理语音检测。

针对当前病理语音诊断系统中存在的数据稀缺性、特征同质化和模型泛化受限等问题，提出了一种融合时频异构声学特征的多类病理语音检测算法。首先采用自监督学习预训练的Wav2vec2-XLSR模型从时域语音信号中提取深度上下文特征。随后，对来自频域的Mel-Frequency倒谱系数（MFCC）特征进行整合，构建异构人声特征空间。创新地应用了Transformer架构中的交叉注意机制，实现了异构特征空间内的动态时空对齐和语义交互，实现了特征的互补增强。最终建立了一个包含元音和句子层次的双粒度联合分析框架，用于高效的多类型病理语音检测。实验结果表明，该算法在句子级Saarbruecken Voice Database （SVD）数据集上达到95.1%的准确率、100%的召回率、0.92的f1得分和0.97的ROC曲线下面积（AUC）值。对于元音级别的分类任务，在Massachusetts Eye and Ear Infirmary （MEEI）和SVD数据集上的分类准确率分别为100%和99.6%。多语料库评价实验验证了该算法在不同数据分布下的鲁棒性和泛化能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Voice 医学-耳鼻喉科学

CiteScore

4.00

自引率

13.60%

发文量

395

审稿时长

59 days

期刊介绍： The Journal of Voice is widely regarded as the world''s premiere journal for voice medicine and research. This peer-reviewed publication is listed in Index Medicus and is indexed by the Institute for Scientific Information. The journal contains articles written by experts throughout the world on all topics in voice sciences, voice medicine and surgery, and speech-language pathologists'' management of voice-related problems. The journal includes clinical articles, clinical research, and laboratory research. Members of the Foundation receive the journal as a benefit of membership.