Facial Depression Estimation via Multi-Cue Contrastive Learning

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-24 DOI:10.1109/TCSVT.2025.3533543

Xinke Wang;Jingyuan Xu;Xiao Sun;Mingzheng Li;Bin Hu;Wei Qian;Dan Guo;Meng Wang

{"title":"Facial Depression Estimation via Multi-Cue Contrastive Learning","authors":"Xinke Wang;Jingyuan Xu;Xiao Sun;Mingzheng Li;Bin Hu;Wei Qian;Dan Guo;Meng Wang","doi":"10.1109/TCSVT.2025.3533543","DOIUrl":null,"url":null,"abstract":"Vision-based depression estimation is an emerging yet impactful task, whose challenge lies in predicting the severity of depression from facial videos lasting at least several minutes. Existing methods primarily focus on fusing frame-level features to create comprehensive representations. However, they often overlook two crucial aspects: 1) inter- and intra-cue correlations, and 2) variations among samples. Hence, simply characterizing sample embeddings while ignoring to mine the relation among multiple cues leads to limitations. To address this problem, we propose a novel Multi-Cue Contrastive Learning (MCCL) framework to mine the relation among multiple cues for discriminative representation. Specifically, we first introduce a novel cross-characteristic attentive interaction module to model the relationship among multiple cues from four facial features (e.g., 3D landmarks, head poses, gazes, FAUs). Then, we propose a temporal segment attentive interaction module to capture the temporal relationships within each facial feature over time intervals. Moreover, we integrate contrastive learning to leverage the variations among samples by regarding the embeddings of inter-cue and intra-cue as positive pairs while considering embeddings from other samples as negative. In this way, the proposed MCCL framework leverages the relationships among the facial features and the variations among samples to enhance the process of multi-cue mining, thereby achieving more accurate facial depression estimation. Extensive experiments on public datasets, DAIC-WOZ, CMDC, and E-DAIC, demonstrate that our model not only outperforms the advanced depression methods but that the discriminative representations of facial behaviors provide potential insights about depression. Our code is available at: <uri>https://github.com/xkwangcn/MCCL.git</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"6007-6020"},"PeriodicalIF":11.1000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10852375/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Vision-based depression estimation is an emerging yet impactful task, whose challenge lies in predicting the severity of depression from facial videos lasting at least several minutes. Existing methods primarily focus on fusing frame-level features to create comprehensive representations. However, they often overlook two crucial aspects: 1) inter- and intra-cue correlations, and 2) variations among samples. Hence, simply characterizing sample embeddings while ignoring to mine the relation among multiple cues leads to limitations. To address this problem, we propose a novel Multi-Cue Contrastive Learning (MCCL) framework to mine the relation among multiple cues for discriminative representation. Specifically, we first introduce a novel cross-characteristic attentive interaction module to model the relationship among multiple cues from four facial features (e.g., 3D landmarks, head poses, gazes, FAUs). Then, we propose a temporal segment attentive interaction module to capture the temporal relationships within each facial feature over time intervals. Moreover, we integrate contrastive learning to leverage the variations among samples by regarding the embeddings of inter-cue and intra-cue as positive pairs while considering embeddings from other samples as negative. In this way, the proposed MCCL framework leverages the relationships among the facial features and the variations among samples to enhance the process of multi-cue mining, thereby achieving more accurate facial depression estimation. Extensive experiments on public datasets, DAIC-WOZ, CMDC, and E-DAIC, demonstrate that our model not only outperforms the advanced depression methods but that the discriminative representations of facial behaviors provide potential insights about depression. Our code is available at: https://github.com/xkwangcn/MCCL.git

查看原文本刊更多论文

基于多线索对比学习的面部抑郁估计

基于视觉的抑郁症估计是一项新兴但有影响力的任务，其挑战在于从持续至少几分钟的面部视频中预测抑郁症的严重程度。现有的方法主要集中在融合帧级特征来创建全面的表示。然而，他们往往忽略了两个关键方面：1)线索之间和线索内部的相关性，以及2)样本之间的差异。因此，简单地描述样本嵌入而忽略挖掘多个线索之间的关系会导致局限性。为了解决这一问题，我们提出了一种新的多线索对比学习（MCCL）框架来挖掘多个线索之间的关系以进行判别表征。具体来说，我们首先引入了一个新颖的跨特征注意交互模块来模拟来自四个面部特征（例如，3D地标、头部姿势、凝视、fau）的多个线索之间的关系。然后，我们提出了一个时间段关注交互模块来捕获每个面部特征在时间间隔内的时间关系。此外，我们整合了对比学习，通过将线索间和线索内的嵌入视为正对，而将其他样本的嵌入视为负对，来利用样本之间的差异。这样，所提出的MCCL框架利用人脸特征之间的关系和样本之间的变化来增强多线索挖掘的过程，从而实现更准确的人脸凹陷估计。在公共数据集（DAIC-WOZ、CMDC和e - aic）上进行的大量实验表明，我们的模型不仅优于先进的抑郁症方法，而且面部行为的判别表征为抑郁症提供了潜在的见解。我们的代码可在：https://github.com/xkwangcn/MCCL.git

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.