A novel hierarchical attention-guided refinement method with EEG assistance for enhancing target speech in a multi-speaker competing environment

IF 8 1区工程技术 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Advanced Engineering Informatics Pub Date : 2025-04-14 DOI:10.1016/j.aei.2025.103363

Zehui Feng , Yangge Yang , Chenqi Zhang , Junxuan Li , Ting Han

{"title":"A novel hierarchical attention-guided refinement method with EEG assistance for enhancing target speech in a multi-speaker competing environment","authors":"Zehui Feng , Yangge Yang , Chenqi Zhang , Junxuan Li , Ting Han","doi":"10.1016/j.aei.2025.103363","DOIUrl":null,"url":null,"abstract":"<div><div>Enhancing target speech in noisy, multi-speaker environments is a critical challenge, particularly in engineering contexts, such as construction sites, factories, and transportation systems, where multi-source competing speech scenarios are common and the need for efficient speech enhancement is critical to ensuring safety and operational effectiveness. The latest research is prone to recovering auditory attention with brain activity assistance. However, existing methods emerged with the challenges of multimodal feature extraction bottleneck, and fusion bottleneck. To address these challenges, this paper proposes a hierarchical attention-guided refinement network for enhancing EEG-assisted speech (HierEEG). HierEEG is an end-to-end explainable time-domain model comprising three core modules: a Multi-Scale Feature Modulation Refinement (MFMR) module, a Hierarchical Attention Fusion (HAF) network, and a Lightweight Speech Decoder. The first module learns the different granularities of feature representations and facilitates the interaction between short-term and long-term features through a feature modulator, obtaining multi-scale refined speech embeddings and EEG features. Then, the second module hierarchically guides the model’s attention focusing on high-level semantic features, outputting the generation of clean speech mask embeddings. Finally, a lightweight speech decoder is used to reconstruct the clean speech sample. Our comprehensive experiments on comparison, ablation, subject-dependent, subject-independent, transfer-learning, engineering, and calculation-cost experiments show that our proposed framework, HierEEG, outperforms state-of-the-art methods on mainstream Cocktail Party Datasets, especially achieving relative improvements of 0.21 dB and 0.15 in SI-SDR and PESQ. The proposed HierEEG validates the robustness in engineer simulated experiment, over 10 dB accuracy even with the various noises, artifacts, and poor contact. Furthermore, HierEEG makes great transfer performance for personalized user-specific adjustments, with simply 12 min of fine-tuning samples. HierEEG’s efficient processing and low computational cost, with under 70 % inference utilization on the Jeston Nano embedding device, enhances the potential applications in multi-speaker competing environments. Finally, the brain region experiment demonstrates the explainability of HierEEG, which ensures that the decisions made by the HierEEG can be understood in the context of the brain’s functional organization.</div></div>","PeriodicalId":50941,"journal":{"name":"Advanced Engineering Informatics","volume":"65 ","pages":"Article 103363"},"PeriodicalIF":8.0000,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advanced Engineering Informatics","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1474034625002563","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Enhancing target speech in noisy, multi-speaker environments is a critical challenge, particularly in engineering contexts, such as construction sites, factories, and transportation systems, where multi-source competing speech scenarios are common and the need for efficient speech enhancement is critical to ensuring safety and operational effectiveness. The latest research is prone to recovering auditory attention with brain activity assistance. However, existing methods emerged with the challenges of multimodal feature extraction bottleneck, and fusion bottleneck. To address these challenges, this paper proposes a hierarchical attention-guided refinement network for enhancing EEG-assisted speech (HierEEG). HierEEG is an end-to-end explainable time-domain model comprising three core modules: a Multi-Scale Feature Modulation Refinement (MFMR) module, a Hierarchical Attention Fusion (HAF) network, and a Lightweight Speech Decoder. The first module learns the different granularities of feature representations and facilitates the interaction between short-term and long-term features through a feature modulator, obtaining multi-scale refined speech embeddings and EEG features. Then, the second module hierarchically guides the model’s attention focusing on high-level semantic features, outputting the generation of clean speech mask embeddings. Finally, a lightweight speech decoder is used to reconstruct the clean speech sample. Our comprehensive experiments on comparison, ablation, subject-dependent, subject-independent, transfer-learning, engineering, and calculation-cost experiments show that our proposed framework, HierEEG, outperforms state-of-the-art methods on mainstream Cocktail Party Datasets, especially achieving relative improvements of 0.21 dB and 0.15 in SI-SDR and PESQ. The proposed HierEEG validates the robustness in engineer simulated experiment, over 10 dB accuracy even with the various noises, artifacts, and poor contact. Furthermore, HierEEG makes great transfer performance for personalized user-specific adjustments, with simply 12 min of fine-tuning samples. HierEEG’s efficient processing and low computational cost, with under 70 % inference utilization on the Jeston Nano embedding device, enhances the potential applications in multi-speaker competing environments. Finally, the brain region experiment demonstrates the explainability of HierEEG, which ensures that the decisions made by the HierEEG can be understood in the context of the brain’s functional organization.

查看原文本刊更多论文

一种基于脑电辅助的分层注意引导改进方法用于多说话人竞争环境下的目标语音增强

在嘈杂的多扬声器环境中增强目标语音是一项关键的挑战，特别是在工程环境中，如建筑工地、工厂和交通系统，在这些环境中，多源竞争语音场景很常见，对高效语音增强的需求对确保安全和运行效率至关重要。最新的研究倾向于在大脑活动的帮助下恢复听觉注意力。然而，现有方法存在多模态特征提取瓶颈和融合瓶颈等问题。为了解决这些问题，本文提出了一种分层注意引导的改进网络来增强脑电辅助语音（HierEEG）。HierEEG是一个端到端可解释的时域模型，包括三个核心模块：多尺度特征调制细化（MFMR）模块、分层注意融合（HAF）网络和轻量级语音解码器。第一个模块学习特征表示的不同粒度，通过特征调制器促进短期和长期特征之间的交互，获得多尺度精细语音嵌入和脑电特征。然后，第二个模块分层引导模型关注高级语义特征，输出生成干净的语音掩码嵌入。最后，使用轻量级语音解码器重构干净的语音样本。我们在对比、消除、科目依赖、科目独立、迁移学习、工程和计算成本实验等方面的综合实验表明，我们提出的框架HierEEG在主流鸡尾酒会数据集上优于最先进的方法，特别是在SI-SDR和PESQ上实现了0.21 dB和0.15 dB的相对改进。在工程模拟实验中验证了该方法的鲁棒性，即使在各种噪声、伪影和接触不良的情况下，精度也超过10 dB。此外，HierEEG为个性化用户特定调整提供了出色的传输性能，只需12分钟的微调样本。HierEEG的高效处理和低计算成本，在Jeston纳米嵌入设备上的推理利用率低于70%，增强了在多扬声器竞争环境中的潜在应用。最后，通过脑区实验验证了分层脑电图的可解释性，确保了分层脑电图的决策可以在大脑功能组织的背景下被理解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Advanced Engineering Informatics 工程技术-工程：综合

CiteScore

12.40

自引率

18.20%

发文量

292

审稿时长

45 days

期刊介绍： Advanced Engineering Informatics is an international Journal that solicits research papers with an emphasis on 'knowledge' and 'engineering applications'. The Journal seeks original papers that report progress in applying methods of engineering informatics. These papers should have engineering relevance and help provide a scientific base for more reliable, spontaneous, and creative engineering decision-making. Additionally, papers should demonstrate the science of supporting knowledge-intensive engineering tasks and validate the generality, power, and scalability of new methods through rigorous evaluation, preferably both qualitatively and quantitatively. Abstracting and indexing for Advanced Engineering Informatics include Science Citation Index Expanded, Scopus and INSPEC.