CFGMamba：用于基于视频的抑郁症识别的交叉帧组Mamba

IF 4.9 2区医学 Q1 ENGINEERING, BIOMEDICAL

Biomedical Signal Processing and Control Pub Date : 2025-06-12 DOI:10.1016/j.bspc.2025.108113

Jingyi Liu , Yuanyuan Shang , Mengyuan Yang , Zhuhong Shao , Hui Ding , Tie Liu

{"title":"CFGMamba：用于基于视频的抑郁症识别的交叉帧组Mamba","authors":"Jingyi Liu , Yuanyuan Shang , Mengyuan Yang , Zhuhong Shao , Hui Ding , Tie Liu","doi":"10.1016/j.bspc.2025.108113","DOIUrl":null,"url":null,"abstract":"<div><div>Depression recognition is a significant research topic in the field of affective computing, which has important value for promoting clinical diagnosis and screening of depression. Video-based depression recognition methods utilize Convolutional Neural Networks (CNNs) or Transformers to capture relevant visual features and achieve promising performance. However, the limited receptive field of CNNs, the high computational resource consumption of Transformer long sequence modeling, and the high dimensionality of video data are key issues to be addressed. Considering these factors, this work introduces the State Space Model (SSM) for depression recognition and proposes a Cross Frame Group Mamba (CFGMamba) framework. CFGMamba alleviates the limitations of CNNs through global receptive fields and can effectively model long-range sequences with linear complexity. Technically, CFGMamba models cross-frame grouping of video data, dividing video frames into several distinct groups at time intervals and then performing bidirectional scanning for each group in the spatial–temporal dimension. This cross-frame grouping strategy efficiently captures richer emotional features while minimizing computational overhead. Meanwhile, CFGMamba incorporates a multi-stage downsampling approach, where multiple CFGMamba blocks are stacked at each stage to progressively capture multi-scale spatial–temporal emotional features from shallow to deep layers. Experimental results on the AVEC 2013 and AVEC 2014 datasets indicate that CFGMamba achieves competitive performance, with MAE/RMSE of 6.01/7.59 and 5.96/7.52, respectively. And the F1-score/AUC is 0.75/0.78 on the EmoReact dataset.</div></div>","PeriodicalId":55362,"journal":{"name":"Biomedical Signal Processing and Control","volume":"110 ","pages":"Article 108113"},"PeriodicalIF":4.9000,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CFGMamba: Cross frame group Mamba for video-based depression recognition\",\"authors\":\"Jingyi Liu , Yuanyuan Shang , Mengyuan Yang , Zhuhong Shao , Hui Ding , Tie Liu\",\"doi\":\"10.1016/j.bspc.2025.108113\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Depression recognition is a significant research topic in the field of affective computing, which has important value for promoting clinical diagnosis and screening of depression. Video-based depression recognition methods utilize Convolutional Neural Networks (CNNs) or Transformers to capture relevant visual features and achieve promising performance. However, the limited receptive field of CNNs, the high computational resource consumption of Transformer long sequence modeling, and the high dimensionality of video data are key issues to be addressed. Considering these factors, this work introduces the State Space Model (SSM) for depression recognition and proposes a Cross Frame Group Mamba (CFGMamba) framework. CFGMamba alleviates the limitations of CNNs through global receptive fields and can effectively model long-range sequences with linear complexity. Technically, CFGMamba models cross-frame grouping of video data, dividing video frames into several distinct groups at time intervals and then performing bidirectional scanning for each group in the spatial–temporal dimension. This cross-frame grouping strategy efficiently captures richer emotional features while minimizing computational overhead. Meanwhile, CFGMamba incorporates a multi-stage downsampling approach, where multiple CFGMamba blocks are stacked at each stage to progressively capture multi-scale spatial–temporal emotional features from shallow to deep layers. Experimental results on the AVEC 2013 and AVEC 2014 datasets indicate that CFGMamba achieves competitive performance, with MAE/RMSE of 6.01/7.59 and 5.96/7.52, respectively. And the F1-score/AUC is 0.75/0.78 on the EmoReact dataset.</div></div>\",\"PeriodicalId\":55362,\"journal\":{\"name\":\"Biomedical Signal Processing and Control\",\"volume\":\"110 \",\"pages\":\"Article 108113\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-06-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Biomedical Signal Processing and Control\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S174680942500624X\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, BIOMEDICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biomedical Signal Processing and Control","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S174680942500624X","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

摘要

抑郁症识别是情感计算领域的一个重要研究课题，对促进抑郁症的临床诊断和筛查具有重要价值。基于视频的抑郁症识别方法利用卷积神经网络（cnn）或变压器来捕获相关的视觉特征，并取得了令人满意的性能。然而，cnn有限的接受域、Transformer长序列建模的高计算资源消耗以及视频数据的高维是需要解决的关键问题。考虑到这些因素，本研究引入了状态空间模型（SSM）用于抑郁症识别，并提出了一个跨框架群体曼巴（CFGMamba）框架。CFGMamba通过全局接受场缓解了cnn的局限性，可以有效地对具有线性复杂度的长时间序列进行建模。从技术上讲，CFGMamba对视频数据进行跨帧分组建模，按时间间隔将视频帧分成若干不同的组，然后在时空维度上对每组进行双向扫描。这种跨帧分组策略有效地捕获了更丰富的情感特征，同时最小化了计算开销。同时，CFGMamba采用多阶段降采样方法，在每个阶段堆叠多个CFGMamba块，从浅层到深层逐步捕获多尺度时空情感特征。在AVEC 2013和AVEC 2014数据集上的实验结果表明，CFGMamba具有较好的性能，MAE/RMSE分别为6.01/7.59和5.96/7.52。在EmoReact数据集上，F1-score/AUC为0.75/0.78。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CFGMamba: Cross frame group Mamba for video-based depression recognition

Depression recognition is a significant research topic in the field of affective computing, which has important value for promoting clinical diagnosis and screening of depression. Video-based depression recognition methods utilize Convolutional Neural Networks (CNNs) or Transformers to capture relevant visual features and achieve promising performance. However, the limited receptive field of CNNs, the high computational resource consumption of Transformer long sequence modeling, and the high dimensionality of video data are key issues to be addressed. Considering these factors, this work introduces the State Space Model (SSM) for depression recognition and proposes a Cross Frame Group Mamba (CFGMamba) framework. CFGMamba alleviates the limitations of CNNs through global receptive fields and can effectively model long-range sequences with linear complexity. Technically, CFGMamba models cross-frame grouping of video data, dividing video frames into several distinct groups at time intervals and then performing bidirectional scanning for each group in the spatial–temporal dimension. This cross-frame grouping strategy efficiently captures richer emotional features while minimizing computational overhead. Meanwhile, CFGMamba incorporates a multi-stage downsampling approach, where multiple CFGMamba blocks are stacked at each stage to progressively capture multi-scale spatial–temporal emotional features from shallow to deep layers. Experimental results on the AVEC 2013 and AVEC 2014 datasets indicate that CFGMamba achieves competitive performance, with MAE/RMSE of 6.01/7.59 and 5.96/7.52, respectively. And the F1-score/AUC is 0.75/0.78 on the EmoReact dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Biomedical Signal Processing and Control 工程技术-工程：生物医学

CiteScore

9.80

自引率

13.70%

发文量

822

审稿时长

4 months

期刊介绍： Biomedical Signal Processing and Control aims to provide a cross-disciplinary international forum for the interchange of information on research in the measurement and analysis of signals and images in clinical medicine and the biological sciences. Emphasis is placed on contributions dealing with the practical, applications-led research on the use of methods and devices in clinical diagnosis, patient monitoring and management. Biomedical Signal Processing and Control reflects the main areas in which these methods are being used and developed at the interface of both engineering and clinical science. The scope of the journal is defined to include relevant review papers, technical notes, short communications and letters. Tutorial papers and special issues will also be published.