Attention-based multi-level feature fusion for voice disorder diagnosis

IF 5.6 2区工程技术 Q1 ENGINEERING, MULTIDISCIPLINARY

Measurement Pub Date : 2025-09-30 DOI:10.1016/j.measurement.2025.119168

Lipeng Shen , Yifan Xiong , Dongyue Guo , Wei Mo , Lingyu Yu , Hui Yang , Yi Lin

{"title":"Attention-based multi-level feature fusion for voice disorder diagnosis","authors":"Lipeng Shen , Yifan Xiong , Dongyue Guo , Wei Mo , Lingyu Yu , Hui Yang , Yi Lin","doi":"10.1016/j.measurement.2025.119168","DOIUrl":null,"url":null,"abstract":"<div><div>Voice disorders negatively impact the quality of daily life in various ways. However, accurately recognizing the category of pathological features from raw audio remains a considerable challenge due to the limited dataset. A promising method to handle this issue is extracting multi-level pathological information from speech in a comprehensive manner by fusing features in the latent space. In this paper, a novel framework is designed to explore the way of high-quality feature fusion for effective and generalized detection performance. Specifically, the proposed model follows a two-stage training paradigm: (1) ECAPA-TDNN and Wav2vec 2.0 which have shown remarkable effectiveness in various domains are employed to learn the universal pathological information from raw audio; (2) An attentive fusion module is dedicatedly designed to establish the interaction between pathological features projected by ECAPA-TDNN and Wav2vec 2.0 respectively and guide the multi-layer fusion, the entire model is jointly fine-tuned from pre-trained features by the automatic voice pathology detection task. Finally, comprehensive experiments demonstrate that the proposed framework outperforms the competitive baselines, achieving the accuracy of 90.51% and 87.68% on the FEMH and SVD datasets, respectively. Furthermore, the proposed framework can achieve the comparable performance of selective baselines with only 70% of the training dataset.</div></div>","PeriodicalId":18349,"journal":{"name":"Measurement","volume":"258 ","pages":"Article 119168"},"PeriodicalIF":5.6000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Measurement","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0263224125025278","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Voice disorders negatively impact the quality of daily life in various ways. However, accurately recognizing the category of pathological features from raw audio remains a considerable challenge due to the limited dataset. A promising method to handle this issue is extracting multi-level pathological information from speech in a comprehensive manner by fusing features in the latent space. In this paper, a novel framework is designed to explore the way of high-quality feature fusion for effective and generalized detection performance. Specifically, the proposed model follows a two-stage training paradigm: (1) ECAPA-TDNN and Wav2vec 2.0 which have shown remarkable effectiveness in various domains are employed to learn the universal pathological information from raw audio; (2) An attentive fusion module is dedicatedly designed to establish the interaction between pathological features projected by ECAPA-TDNN and Wav2vec 2.0 respectively and guide the multi-layer fusion, the entire model is jointly fine-tuned from pre-trained features by the automatic voice pathology detection task. Finally, comprehensive experiments demonstrate that the proposed framework outperforms the competitive baselines, achieving the accuracy of 90.51% and 87.68% on the FEMH and SVD datasets, respectively. Furthermore, the proposed framework can achieve the comparable performance of selective baselines with only 70% of the training dataset.

查看原文本刊更多论文

基于注意力的多级特征融合语音障碍诊断

声音障碍以各种方式对日常生活质量产生负面影响。然而，由于数据集有限，从原始音频中准确识别病理特征的类别仍然是一个相当大的挑战。通过融合潜在空间的特征，从语音中综合提取多层次的病理信息是解决这一问题的一种很有前景的方法。本文设计了一种新的框架来探索高质量特征融合的方法，以获得有效和广义的检测性能。具体来说，该模型采用了两阶段的训练范式：(1)采用在各个领域都表现出显著效果的ECAPA-TDNN和Wav2vec 2.0从原始音频中学习通用的病理信息；(2)专门设计了一个细心融合模块，分别建立ECAPA-TDNN和Wav2vec 2.0投影的病理特征之间的相互作用，并指导多层融合，整个模型由语音病理自动检测任务从预训练的特征中共同微调而成。最后，综合实验表明，该框架在FEMH和SVD数据集上的准确率分别达到90.51%和87.68%，优于竞争基准。此外，该框架仅使用70%的训练数据集就可以实现选择性基线的可比性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Measurement 工程技术-工程：综合

CiteScore

10.20

自引率

12.50%

发文量

1589

审稿时长

12.1 months

期刊介绍： Contributions are invited on novel achievements in all fields of measurement and instrumentation science and technology. Authors are encouraged to submit novel material, whose ultimate goal is an advancement in the state of the art of: measurement and metrology fundamentals, sensors, measurement instruments, measurement and estimation techniques, measurement data processing and fusion algorithms, evaluation procedures and methodologies for plants and industrial processes, performance analysis of systems, processes and algorithms, mathematical models for measurement-oriented purposes, distributed measurement systems in a connected world.