{"title":"Weakly supervised temporal action localization via a multimodal feature map diffusion process","authors":"Yuanbing Zou , Qingjie Zhao , Shanshan Li","doi":"10.1016/j.engappai.2025.111044","DOIUrl":null,"url":null,"abstract":"<div><div>With the continuous growth of massive video data, understanding video content has become increasingly important. Weakly supervised temporal action localization (WTAL), as a critical task, has received significant attention. The goal of WTAL is to learn temporal class activation maps (TCAMs) using only video-level annotations and perform temporal action localization via post-processing steps. However, due to the lack of detailed behavioral information in video-level annotations, the separability between foreground and background in the learned TCAM is poor, leading to incomplete action predictions. To this end, we leverage the inherent advantages of the Contrastive Language-Image Pre-training (CLIP) model in generating high-semantic visual features. By integrating CLIP-based visual information, we further enhance the representational capability of action features. We propose a novel multimodal feature map generation method based on diffusion models to fully exploit the complementary relationships between modalities. Specifically, we design a hard masking strategy to generate hard masks, which are then used as frame-level pseudo-ground truth inputs for the diffusion model. These masks are used to convey human behavior knowledge, enhancing the model’s generative capacity. Subsequently, the concatenated multimodal feature maps are employed as conditional inputs to guide the generation of diffusion feature maps. This design enables the model to extract rich action cues from diverse modalities. Experimental results demonstrate that our approach achieves state-of-the-art performance on two popular benchmarks. These results highlight the proposed method’s capability to achieve precise and efficient temporal action detection under weak supervision, making a significant contribution to the advancement in large-scale video data analysis.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"156 ","pages":"Article 111044"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625010449","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
With the continuous growth of massive video data, understanding video content has become increasingly important. Weakly supervised temporal action localization (WTAL), as a critical task, has received significant attention. The goal of WTAL is to learn temporal class activation maps (TCAMs) using only video-level annotations and perform temporal action localization via post-processing steps. However, due to the lack of detailed behavioral information in video-level annotations, the separability between foreground and background in the learned TCAM is poor, leading to incomplete action predictions. To this end, we leverage the inherent advantages of the Contrastive Language-Image Pre-training (CLIP) model in generating high-semantic visual features. By integrating CLIP-based visual information, we further enhance the representational capability of action features. We propose a novel multimodal feature map generation method based on diffusion models to fully exploit the complementary relationships between modalities. Specifically, we design a hard masking strategy to generate hard masks, which are then used as frame-level pseudo-ground truth inputs for the diffusion model. These masks are used to convey human behavior knowledge, enhancing the model’s generative capacity. Subsequently, the concatenated multimodal feature maps are employed as conditional inputs to guide the generation of diffusion feature maps. This design enables the model to extract rich action cues from diverse modalities. Experimental results demonstrate that our approach achieves state-of-the-art performance on two popular benchmarks. These results highlight the proposed method’s capability to achieve precise and efficient temporal action detection under weak supervision, making a significant contribution to the advancement in large-scale video data analysis.
期刊介绍:
Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.