{"title":"Video Complicated-Information Extraction and Filtering Network for Weakly-Supervised Temporal Action Localization","authors":"Jiaxuan Li;Tiancheng Ma;Xiaohui Yang;Lijun Yang;Chen Zheng","doi":"10.1109/LSP.2025.3575626","DOIUrl":null,"url":null,"abstract":"Weakly-supervised temporal action localiza- tion aims to identify action instances using only video-level labels, and localize the action position in untrimmed videos. Due to the temporal continuity of video data, most methods that use single scale convolution kernel cannot model against the characterization of video data effectively, and lead to a decrease in accuracy. However, simply using multi-scale features can introduce redundant information and noise, reducing model efficiency while also affecting the accurate judgement of the model during training process. To alleviate this problem, a video complicated-information extraction and filtering network (VCEF-Net) is proposed. It contains two main modules. The first multi-scale feature extraction module is developed to enrich the information that model received. The second pseudo-label filtering module inhibits redundant information interference. VCEF-Net introduces these two modules for achieving a better utilization of video information. Experiments tested on THUMOS14 and ActivityNet1.2 demonstrate better performances of the proposed VCEF-Net and validate its effectiveness.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"2334-2338"},"PeriodicalIF":3.2000,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11020805/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Weakly-supervised temporal action localiza- tion aims to identify action instances using only video-level labels, and localize the action position in untrimmed videos. Due to the temporal continuity of video data, most methods that use single scale convolution kernel cannot model against the characterization of video data effectively, and lead to a decrease in accuracy. However, simply using multi-scale features can introduce redundant information and noise, reducing model efficiency while also affecting the accurate judgement of the model during training process. To alleviate this problem, a video complicated-information extraction and filtering network (VCEF-Net) is proposed. It contains two main modules. The first multi-scale feature extraction module is developed to enrich the information that model received. The second pseudo-label filtering module inhibits redundant information interference. VCEF-Net introduces these two modules for achieving a better utilization of video information. Experiments tested on THUMOS14 and ActivityNet1.2 demonstrate better performances of the proposed VCEF-Net and validate its effectiveness.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.