Weakly Supervised Micro- and Macro-Expression Spotting Based on Multi-Level Consistency

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-04-28 DOI:10.1109/TPAMI.2025.3564951

Wang-Wang Yu;Kai-Fu Yang;Hong-Mei Yan;Yong-Jie Li

{"title":"Weakly Supervised Micro- and Macro-Expression Spotting Based on Multi-Level Consistency","authors":"Wang-Wang Yu;Kai-Fu Yang;Hong-Mei Yan;Yong-Jie Li","doi":"10.1109/TPAMI.2025.3564951","DOIUrl":null,"url":null,"abstract":"Most micro- and macro-expression spotting methods in untrimmed videos suffer from the burden of video-wise collection and frame-wise annotation. Weakly supervised expression spotting (WES) based on video-level labels can potentially mitigate the complexity of frame-level annotation while achieving fine-grained frame-level spotting. However, we argue that existing weakly supervised methods are based on multiple instance learning (MIL) involving inter-modality, inter-sample, and inter-task gaps. The inter-sample gap is primarily from the sample distribution and duration. Therefore, we propose a novel and simple WES framework, MC-WES, using multi-consistency collaborative mechanisms that include modal-level saliency, video-level distribution, label-level duration and segment-level feature consistency strategies to implement fine frame-level spotting with only video-level labels to alleviate the above gaps and merge prior knowledge. The modal-level saliency consistency strategy focuses on capturing key correlations between raw images and optical flow. The video-level distribution consistency strategy utilizes the difference of sparsity in temporal distribution. The label-level duration consistency strategy exploits the difference in the duration of facial muscles. The segment-level feature consistency strategy emphasizes that features under the same labels maintain similarity. Experimental results on three challenging datasets–CAS(ME)<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>, CAS(ME)<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>, and SAMM-LV–demonstrate that MC-WES is comparable to state-of-the-art fully supervised methods.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 8","pages":"6912-6928"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10979496/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Most micro- and macro-expression spotting methods in untrimmed videos suffer from the burden of video-wise collection and frame-wise annotation. Weakly supervised expression spotting (WES) based on video-level labels can potentially mitigate the complexity of frame-level annotation while achieving fine-grained frame-level spotting. However, we argue that existing weakly supervised methods are based on multiple instance learning (MIL) involving inter-modality, inter-sample, and inter-task gaps. The inter-sample gap is primarily from the sample distribution and duration. Therefore, we propose a novel and simple WES framework, MC-WES, using multi-consistency collaborative mechanisms that include modal-level saliency, video-level distribution, label-level duration and segment-level feature consistency strategies to implement fine frame-level spotting with only video-level labels to alleviate the above gaps and merge prior knowledge. The modal-level saliency consistency strategy focuses on capturing key correlations between raw images and optical flow. The video-level distribution consistency strategy utilizes the difference of sparsity in temporal distribution. The label-level duration consistency strategy exploits the difference in the duration of facial muscles. The segment-level feature consistency strategy emphasizes that features under the same labels maintain similarity. Experimental results on three challenging datasets–CAS(ME)

$^{2}$

, CAS(ME)

$^{3}$

, and SAMM-LV–demonstrate that MC-WES is comparable to state-of-the-art fully supervised methods.

查看原文本刊更多论文

基于多级一致性的弱监督宏、微观表达式定位

大多数未经裁剪视频的宏、微表情识别方法都存在视频智能采集和帧智能标注的问题。基于视频级标签的弱监督表达标记（WES）可以在实现细粒度帧级标记的同时降低帧级标注的复杂性。然而，我们认为现有的弱监督方法是基于涉及模态间、样本间和任务间差距的多实例学习（MIL）。样本间差距主要来自样本分布和持续时间。因此，我们提出了一种新颖而简单的WES框架MC-WES，该框架使用多一致性协作机制，包括模型级显著性、视频级分布、标签级持续时间和片段级特征一致性策略，实现仅使用视频级标签的精细帧级定位，以缓解上述差距并合并先验知识。模型级显著性一致性策略侧重于捕获原始图像和光流之间的关键相关性。视频级分布一致性策略利用了时间分布的稀疏性差异。标签级持续时间一致性策略利用了面部肌肉持续时间的差异。段级特征一致性策略强调相同标签下的特征保持相似性。在CAS(ME)$^{2}$、CAS(ME)$^{3}$和samm - lv三个具有挑战性的数据集上的实验结果表明，MC-WES可与最先进的完全监督方法相媲美。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量