Weakly supervised temporal action localization via a multimodal feature map diffusion process

IF 7.5 2区计算机科学 Q1 AUTOMATION & CONTROL SYSTEMS

Engineering Applications of Artificial Intelligence Pub Date : 2025-05-20 DOI:10.1016/j.engappai.2025.111044

Yuanbing Zou , Qingjie Zhao , Shanshan Li

{"title":"Weakly supervised temporal action localization via a multimodal feature map diffusion process","authors":"Yuanbing Zou , Qingjie Zhao , Shanshan Li","doi":"10.1016/j.engappai.2025.111044","DOIUrl":null,"url":null,"abstract":"<div><div>With the continuous growth of massive video data, understanding video content has become increasingly important. Weakly supervised temporal action localization (WTAL), as a critical task, has received significant attention. The goal of WTAL is to learn temporal class activation maps (TCAMs) using only video-level annotations and perform temporal action localization via post-processing steps. However, due to the lack of detailed behavioral information in video-level annotations, the separability between foreground and background in the learned TCAM is poor, leading to incomplete action predictions. To this end, we leverage the inherent advantages of the Contrastive Language-Image Pre-training (CLIP) model in generating high-semantic visual features. By integrating CLIP-based visual information, we further enhance the representational capability of action features. We propose a novel multimodal feature map generation method based on diffusion models to fully exploit the complementary relationships between modalities. Specifically, we design a hard masking strategy to generate hard masks, which are then used as frame-level pseudo-ground truth inputs for the diffusion model. These masks are used to convey human behavior knowledge, enhancing the model’s generative capacity. Subsequently, the concatenated multimodal feature maps are employed as conditional inputs to guide the generation of diffusion feature maps. This design enables the model to extract rich action cues from diverse modalities. Experimental results demonstrate that our approach achieves state-of-the-art performance on two popular benchmarks. These results highlight the proposed method’s capability to achieve precise and efficient temporal action detection under weak supervision, making a significant contribution to the advancement in large-scale video data analysis.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"156 ","pages":"Article 111044"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625010449","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

With the continuous growth of massive video data, understanding video content has become increasingly important. Weakly supervised temporal action localization (WTAL), as a critical task, has received significant attention. The goal of WTAL is to learn temporal class activation maps (TCAMs) using only video-level annotations and perform temporal action localization via post-processing steps. However, due to the lack of detailed behavioral information in video-level annotations, the separability between foreground and background in the learned TCAM is poor, leading to incomplete action predictions. To this end, we leverage the inherent advantages of the Contrastive Language-Image Pre-training (CLIP) model in generating high-semantic visual features. By integrating CLIP-based visual information, we further enhance the representational capability of action features. We propose a novel multimodal feature map generation method based on diffusion models to fully exploit the complementary relationships between modalities. Specifically, we design a hard masking strategy to generate hard masks, which are then used as frame-level pseudo-ground truth inputs for the diffusion model. These masks are used to convey human behavior knowledge, enhancing the model’s generative capacity. Subsequently, the concatenated multimodal feature maps are employed as conditional inputs to guide the generation of diffusion feature maps. This design enables the model to extract rich action cues from diverse modalities. Experimental results demonstrate that our approach achieves state-of-the-art performance on two popular benchmarks. These results highlight the proposed method’s capability to achieve precise and efficient temporal action detection under weak supervision, making a significant contribution to the advancement in large-scale video data analysis.

查看原文本刊更多论文

基于多模态特征映射扩散过程的弱监督时间动作定位

随着海量视频数据的不断增长，理解视频内容变得越来越重要。弱监督时间动作定位（WTAL）作为一项关键任务受到了广泛的关注。WTAL的目标是仅使用视频级注释来学习时态类激活图（TCAMs），并通过后处理步骤执行时态动作定位。然而，由于视频级注释中缺乏详细的行为信息，习得的TCAM中前景和背景的可分离性较差，导致动作预测不完整。为此，我们利用对比语言图像预训练（CLIP）模型在生成高语义视觉特征方面的固有优势。通过集成基于clip的视觉信息，进一步增强了动作特征的表示能力。为了充分利用模态之间的互补关系，提出了一种基于扩散模型的多模态特征图生成方法。具体来说，我们设计了一个硬掩模策略来生成硬掩模，然后将其用作扩散模型的帧级伪地面真值输入。这些面具用于传递人类行为知识，增强模型的生成能力。随后，将连接的多模态特征图作为条件输入来指导扩散特征图的生成。这种设计使模型能够从不同的模式中提取丰富的动作线索。实验结果表明，我们的方法在两个流行的基准测试中达到了最先进的性能。这些结果突出了该方法在弱监督下实现精确、高效的时间动作检测的能力，为大规模视频数据分析的进步做出了重要贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Engineering Applications of Artificial Intelligence 工程技术-工程：电子与电气

CiteScore

9.60

自引率

10.00%

发文量

505

审稿时长

68 days

期刊介绍： Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.