Attention-based multi-level feature fusion for voice disorder diagnosis

IF 5.6 2区 工程技术 Q1 ENGINEERING, MULTIDISCIPLINARY
Lipeng Shen , Yifan Xiong , Dongyue Guo , Wei Mo , Lingyu Yu , Hui Yang , Yi Lin
{"title":"Attention-based multi-level feature fusion for voice disorder diagnosis","authors":"Lipeng Shen ,&nbsp;Yifan Xiong ,&nbsp;Dongyue Guo ,&nbsp;Wei Mo ,&nbsp;Lingyu Yu ,&nbsp;Hui Yang ,&nbsp;Yi Lin","doi":"10.1016/j.measurement.2025.119168","DOIUrl":null,"url":null,"abstract":"<div><div>Voice disorders negatively impact the quality of daily life in various ways. However, accurately recognizing the category of pathological features from raw audio remains a considerable challenge due to the limited dataset. A promising method to handle this issue is extracting multi-level pathological information from speech in a comprehensive manner by fusing features in the latent space. In this paper, a novel framework is designed to explore the way of high-quality feature fusion for effective and generalized detection performance. Specifically, the proposed model follows a two-stage training paradigm: (1) ECAPA-TDNN and Wav2vec 2.0 which have shown remarkable effectiveness in various domains are employed to learn the universal pathological information from raw audio; (2) An attentive fusion module is dedicatedly designed to establish the interaction between pathological features projected by ECAPA-TDNN and Wav2vec 2.0 respectively and guide the multi-layer fusion, the entire model is jointly fine-tuned from pre-trained features by the automatic voice pathology detection task. Finally, comprehensive experiments demonstrate that the proposed framework outperforms the competitive baselines, achieving the accuracy of 90.51% and 87.68% on the FEMH and SVD datasets, respectively. Furthermore, the proposed framework can achieve the comparable performance of selective baselines with only 70% of the training dataset.</div></div>","PeriodicalId":18349,"journal":{"name":"Measurement","volume":"258 ","pages":"Article 119168"},"PeriodicalIF":5.6000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Measurement","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0263224125025278","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Voice disorders negatively impact the quality of daily life in various ways. However, accurately recognizing the category of pathological features from raw audio remains a considerable challenge due to the limited dataset. A promising method to handle this issue is extracting multi-level pathological information from speech in a comprehensive manner by fusing features in the latent space. In this paper, a novel framework is designed to explore the way of high-quality feature fusion for effective and generalized detection performance. Specifically, the proposed model follows a two-stage training paradigm: (1) ECAPA-TDNN and Wav2vec 2.0 which have shown remarkable effectiveness in various domains are employed to learn the universal pathological information from raw audio; (2) An attentive fusion module is dedicatedly designed to establish the interaction between pathological features projected by ECAPA-TDNN and Wav2vec 2.0 respectively and guide the multi-layer fusion, the entire model is jointly fine-tuned from pre-trained features by the automatic voice pathology detection task. Finally, comprehensive experiments demonstrate that the proposed framework outperforms the competitive baselines, achieving the accuracy of 90.51% and 87.68% on the FEMH and SVD datasets, respectively. Furthermore, the proposed framework can achieve the comparable performance of selective baselines with only 70% of the training dataset.
基于注意力的多级特征融合语音障碍诊断
声音障碍以各种方式对日常生活质量产生负面影响。然而,由于数据集有限,从原始音频中准确识别病理特征的类别仍然是一个相当大的挑战。通过融合潜在空间的特征,从语音中综合提取多层次的病理信息是解决这一问题的一种很有前景的方法。本文设计了一种新的框架来探索高质量特征融合的方法,以获得有效和广义的检测性能。具体来说,该模型采用了两阶段的训练范式:(1)采用在各个领域都表现出显著效果的ECAPA-TDNN和Wav2vec 2.0从原始音频中学习通用的病理信息;(2)专门设计了一个细心融合模块,分别建立ECAPA-TDNN和Wav2vec 2.0投影的病理特征之间的相互作用,并指导多层融合,整个模型由语音病理自动检测任务从预训练的特征中共同微调而成。最后,综合实验表明,该框架在FEMH和SVD数据集上的准确率分别达到90.51%和87.68%,优于竞争基准。此外,该框架仅使用70%的训练数据集就可以实现选择性基线的可比性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Measurement
Measurement 工程技术-工程:综合
CiteScore
10.20
自引率
12.50%
发文量
1589
审稿时长
12.1 months
期刊介绍: Contributions are invited on novel achievements in all fields of measurement and instrumentation science and technology. Authors are encouraged to submit novel material, whose ultimate goal is an advancement in the state of the art of: measurement and metrology fundamentals, sensors, measurement instruments, measurement and estimation techniques, measurement data processing and fusion algorithms, evaluation procedures and methodologies for plants and industrial processes, performance analysis of systems, processes and algorithms, mathematical models for measurement-oriented purposes, distributed measurement systems in a connected world.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信