A hybrid model using multimodal feature perception and multiple cross-attention fusion for depressive episodes detection

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-06-04 DOI:10.1016/j.inffus.2025.103354

Yaqi Wang , Tingting Qu , Wenbo Zhu , Qi Wang , Yuping Cao , Renzhou Gui

{"title":"A hybrid model using multimodal feature perception and multiple cross-attention fusion for depressive episodes detection","authors":"Yaqi Wang , Tingting Qu , Wenbo Zhu , Qi Wang , Yuping Cao , Renzhou Gui","doi":"10.1016/j.inffus.2025.103354","DOIUrl":null,"url":null,"abstract":"<div><div>Depressive episodes are among the most prevalent manifestations of mood disorders worldwide. Currently, the diagnosis of depressive episodes primarily relies on professional clinical assessments. However, with the rising prevalence of depressive episodes, together with the increased diversity of subtypes, atypical presentations, and insidiousness of symptoms, timely and accurate detection of depressive episodes has become more difficult. To address this issue, a hybrid model based on multimodal feature perception and multiple cross-attention fusion (MFCAF) is proposed for the automated detection of depressive episodes. MFCAF integrates video, audio, and functional near-infrared spectroscopy (fNIRS) data collected under identical stimulus conditions. It consists of two primary phases: feature perception and feature fusion. In the feature perception stage, a multi-scale convolutional neural network (CNN) combined with a gated recurrent unit (GRU) is utilized to extract video features. Meanwhile, deep audio features are extracted by applying a Vision Transformer (ViT) to the heatmap generated from the correlation matrix of the Mel spectrogram. Additionally, a multi-channel CNN is used to extract fNIRS features. In the feature fusion stage, a Transformer-based multiple cross-attention fusion module is constructed to capture complex cross-modal dependencies. The experimental results show that, on the dataset collected from 122 participants, MFCAF can detect depressive episodes quickly and accurately, outperforming the baseline methods. The MFCAF model achieved an accuracy of 78.38% under the negative stimulus task. These results suggest that the proposed model holds promise as a rapid auxiliary detection tool for depressive episodes in large-scale populations.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"124 ","pages":"Article 103354"},"PeriodicalIF":15.5000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525004270","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Depressive episodes are among the most prevalent manifestations of mood disorders worldwide. Currently, the diagnosis of depressive episodes primarily relies on professional clinical assessments. However, with the rising prevalence of depressive episodes, together with the increased diversity of subtypes, atypical presentations, and insidiousness of symptoms, timely and accurate detection of depressive episodes has become more difficult. To address this issue, a hybrid model based on multimodal feature perception and multiple cross-attention fusion (MFCAF) is proposed for the automated detection of depressive episodes. MFCAF integrates video, audio, and functional near-infrared spectroscopy (fNIRS) data collected under identical stimulus conditions. It consists of two primary phases: feature perception and feature fusion. In the feature perception stage, a multi-scale convolutional neural network (CNN) combined with a gated recurrent unit (GRU) is utilized to extract video features. Meanwhile, deep audio features are extracted by applying a Vision Transformer (ViT) to the heatmap generated from the correlation matrix of the Mel spectrogram. Additionally, a multi-channel CNN is used to extract fNIRS features. In the feature fusion stage, a Transformer-based multiple cross-attention fusion module is constructed to capture complex cross-modal dependencies. The experimental results show that, on the dataset collected from 122 participants, MFCAF can detect depressive episodes quickly and accurately, outperforming the baseline methods. The MFCAF model achieved an accuracy of 78.38% under the negative stimulus task. These results suggest that the proposed model holds promise as a rapid auxiliary detection tool for depressive episodes in large-scale populations.

Abstract Image

查看原文本刊更多论文

基于多模态特征感知和多重交叉注意融合的抑郁发作检测混合模型

抑郁发作是世界范围内情绪障碍最普遍的表现之一。目前，抑郁症发作的诊断主要依靠专业的临床评估。然而，随着抑郁症发病率的上升，以及亚型多样性的增加、非典型表现和症状的隐匿性，及时准确地检测抑郁症发作变得更加困难。为了解决这一问题，提出了一种基于多模态特征感知和多交叉注意融合（MFCAF）的抑郁症发作自动检测模型。MFCAF集成了在相同刺激条件下收集的视频、音频和功能近红外光谱（fNIRS）数据。它包括两个主要阶段：特征感知和特征融合。在特征感知阶段，利用多尺度卷积神经网络（CNN）结合门控递归单元（GRU）提取视频特征。同时，对Mel谱图相关矩阵生成的热图进行视觉变换（Vision Transformer, ViT），提取深度音频特征。此外，采用多通道CNN提取近红外光谱特征。在特征融合阶段，构建了基于transformer的多交叉关注融合模块，以捕获复杂的跨模态依赖关系。实验结果表明，在122名参与者的数据集上，MFCAF可以快速准确地检测出抑郁发作，优于基线方法。MFCAF模型在负刺激任务下的准确率为78.38%。这些结果表明，所提出的模型有望成为大规模人群中抑郁症发作的快速辅助检测工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.