Xinke Wang;Jingyuan Xu;Xiao Sun;Mingzheng Li;Bin Hu;Wei Qian;Dan Guo;Meng Wang
{"title":"Facial Depression Estimation via Multi-Cue Contrastive Learning","authors":"Xinke Wang;Jingyuan Xu;Xiao Sun;Mingzheng Li;Bin Hu;Wei Qian;Dan Guo;Meng Wang","doi":"10.1109/TCSVT.2025.3533543","DOIUrl":null,"url":null,"abstract":"Vision-based depression estimation is an emerging yet impactful task, whose challenge lies in predicting the severity of depression from facial videos lasting at least several minutes. Existing methods primarily focus on fusing frame-level features to create comprehensive representations. However, they often overlook two crucial aspects: 1) inter- and intra-cue correlations, and 2) variations among samples. Hence, simply characterizing sample embeddings while ignoring to mine the relation among multiple cues leads to limitations. To address this problem, we propose a novel Multi-Cue Contrastive Learning (MCCL) framework to mine the relation among multiple cues for discriminative representation. Specifically, we first introduce a novel cross-characteristic attentive interaction module to model the relationship among multiple cues from four facial features (e.g., 3D landmarks, head poses, gazes, FAUs). Then, we propose a temporal segment attentive interaction module to capture the temporal relationships within each facial feature over time intervals. Moreover, we integrate contrastive learning to leverage the variations among samples by regarding the embeddings of inter-cue and intra-cue as positive pairs while considering embeddings from other samples as negative. In this way, the proposed MCCL framework leverages the relationships among the facial features and the variations among samples to enhance the process of multi-cue mining, thereby achieving more accurate facial depression estimation. Extensive experiments on public datasets, DAIC-WOZ, CMDC, and E-DAIC, demonstrate that our model not only outperforms the advanced depression methods but that the discriminative representations of facial behaviors provide potential insights about depression. Our code is available at: <uri>https://github.com/xkwangcn/MCCL.git</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"6007-6020"},"PeriodicalIF":11.1000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10852375/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Vision-based depression estimation is an emerging yet impactful task, whose challenge lies in predicting the severity of depression from facial videos lasting at least several minutes. Existing methods primarily focus on fusing frame-level features to create comprehensive representations. However, they often overlook two crucial aspects: 1) inter- and intra-cue correlations, and 2) variations among samples. Hence, simply characterizing sample embeddings while ignoring to mine the relation among multiple cues leads to limitations. To address this problem, we propose a novel Multi-Cue Contrastive Learning (MCCL) framework to mine the relation among multiple cues for discriminative representation. Specifically, we first introduce a novel cross-characteristic attentive interaction module to model the relationship among multiple cues from four facial features (e.g., 3D landmarks, head poses, gazes, FAUs). Then, we propose a temporal segment attentive interaction module to capture the temporal relationships within each facial feature over time intervals. Moreover, we integrate contrastive learning to leverage the variations among samples by regarding the embeddings of inter-cue and intra-cue as positive pairs while considering embeddings from other samples as negative. In this way, the proposed MCCL framework leverages the relationships among the facial features and the variations among samples to enhance the process of multi-cue mining, thereby achieving more accurate facial depression estimation. Extensive experiments on public datasets, DAIC-WOZ, CMDC, and E-DAIC, demonstrate that our model not only outperforms the advanced depression methods but that the discriminative representations of facial behaviors provide potential insights about depression. Our code is available at: https://github.com/xkwangcn/MCCL.git
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.