{"title":"GCLNet: Generalized Contrastive Learning for Weakly Supervised Temporal Action Localization","authors":"Jing Wang;Dehui Kong;Baocai Yin","doi":"10.1109/TBDATA.2025.3528727","DOIUrl":null,"url":null,"abstract":"Weakly supervised temporal action localization (WTAL) aims to precisely locate action instances in given videos by video-level classification supervision, which is partly related to action classification. Most existing localization works directly utilize feature encoders pre-trained for video classification tasks to extract video features, resulting in non-targeted features that lead to incomplete or over-complete action localization. Therefore, we propose Generalized Contrast Learning Network (GCLNet), in which two novel strategies are proposed to improve the pre-trained features. First, to address the issue of over-completeness, GCLNet introduces text information with good context independence and category separability to enrich the expression of video features, as well as proposes a novel generalized contrastive learning approach for similarity metrics, which facilitates pulling closer the features belonging to the same category while pushing farther apart those from different categories. Consequently, it enables more compact intra-class feature learning and ensures accurate action localization. Second, to tackle the problem of incomplete, we exploit the respective advantages of RGB and Flow features in scene appearance and temporal motion expression, designing a hybrid attention strategy in GCLNet to enhance each channel features mutually. This process greatly improves the features through establishing cross-channel consensus. Finally, we conduct extensive experiments on THUMOS14 and ActivityNet1.2, respectively, and the results show that our proposed GCLNet can produce more representative action localization features.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 5","pages":"2365-2375"},"PeriodicalIF":5.7000,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10840253/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Weakly supervised temporal action localization (WTAL) aims to precisely locate action instances in given videos by video-level classification supervision, which is partly related to action classification. Most existing localization works directly utilize feature encoders pre-trained for video classification tasks to extract video features, resulting in non-targeted features that lead to incomplete or over-complete action localization. Therefore, we propose Generalized Contrast Learning Network (GCLNet), in which two novel strategies are proposed to improve the pre-trained features. First, to address the issue of over-completeness, GCLNet introduces text information with good context independence and category separability to enrich the expression of video features, as well as proposes a novel generalized contrastive learning approach for similarity metrics, which facilitates pulling closer the features belonging to the same category while pushing farther apart those from different categories. Consequently, it enables more compact intra-class feature learning and ensures accurate action localization. Second, to tackle the problem of incomplete, we exploit the respective advantages of RGB and Flow features in scene appearance and temporal motion expression, designing a hybrid attention strategy in GCLNet to enhance each channel features mutually. This process greatly improves the features through establishing cross-channel consensus. Finally, we conduct extensive experiments on THUMOS14 and ActivityNet1.2, respectively, and the results show that our proposed GCLNet can produce more representative action localization features.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.