Yaqi Wang , Tingting Qu , Wenbo Zhu , Qi Wang , Yuping Cao , Renzhou Gui
{"title":"A hybrid model using multimodal feature perception and multiple cross-attention fusion for depressive episodes detection","authors":"Yaqi Wang , Tingting Qu , Wenbo Zhu , Qi Wang , Yuping Cao , Renzhou Gui","doi":"10.1016/j.inffus.2025.103354","DOIUrl":null,"url":null,"abstract":"<div><div>Depressive episodes are among the most prevalent manifestations of mood disorders worldwide. Currently, the diagnosis of depressive episodes primarily relies on professional clinical assessments. However, with the rising prevalence of depressive episodes, together with the increased diversity of subtypes, atypical presentations, and insidiousness of symptoms, timely and accurate detection of depressive episodes has become more difficult. To address this issue, a hybrid model based on multimodal feature perception and multiple cross-attention fusion (MFCAF) is proposed for the automated detection of depressive episodes. MFCAF integrates video, audio, and functional near-infrared spectroscopy (fNIRS) data collected under identical stimulus conditions. It consists of two primary phases: feature perception and feature fusion. In the feature perception stage, a multi-scale convolutional neural network (CNN) combined with a gated recurrent unit (GRU) is utilized to extract video features. Meanwhile, deep audio features are extracted by applying a Vision Transformer (ViT) to the heatmap generated from the correlation matrix of the Mel spectrogram. Additionally, a multi-channel CNN is used to extract fNIRS features. In the feature fusion stage, a Transformer-based multiple cross-attention fusion module is constructed to capture complex cross-modal dependencies. The experimental results show that, on the dataset collected from 122 participants, MFCAF can detect depressive episodes quickly and accurately, outperforming the baseline methods. The MFCAF model achieved an accuracy of 78.38% under the negative stimulus task. These results suggest that the proposed model holds promise as a rapid auxiliary detection tool for depressive episodes in large-scale populations.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"124 ","pages":"Article 103354"},"PeriodicalIF":15.5000,"publicationDate":"2025-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525004270","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Depressive episodes are among the most prevalent manifestations of mood disorders worldwide. Currently, the diagnosis of depressive episodes primarily relies on professional clinical assessments. However, with the rising prevalence of depressive episodes, together with the increased diversity of subtypes, atypical presentations, and insidiousness of symptoms, timely and accurate detection of depressive episodes has become more difficult. To address this issue, a hybrid model based on multimodal feature perception and multiple cross-attention fusion (MFCAF) is proposed for the automated detection of depressive episodes. MFCAF integrates video, audio, and functional near-infrared spectroscopy (fNIRS) data collected under identical stimulus conditions. It consists of two primary phases: feature perception and feature fusion. In the feature perception stage, a multi-scale convolutional neural network (CNN) combined with a gated recurrent unit (GRU) is utilized to extract video features. Meanwhile, deep audio features are extracted by applying a Vision Transformer (ViT) to the heatmap generated from the correlation matrix of the Mel spectrogram. Additionally, a multi-channel CNN is used to extract fNIRS features. In the feature fusion stage, a Transformer-based multiple cross-attention fusion module is constructed to capture complex cross-modal dependencies. The experimental results show that, on the dataset collected from 122 participants, MFCAF can detect depressive episodes quickly and accurately, outperforming the baseline methods. The MFCAF model achieved an accuracy of 78.38% under the negative stimulus task. These results suggest that the proposed model holds promise as a rapid auxiliary detection tool for depressive episodes in large-scale populations.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.