Enhancing action discrimination via category-specific frame clustering for weakly-supervised temporal action localization

IF 2.9 3区工程技术 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers of Information Technology & Electronic Engineering Pub Date : 2024-07-05 DOI:10.1631/fitee.2300024

Huifen Xia, Yongzhao Zhan, Honglin Liu, Xiaopeng Ren

{"title":"Enhancing action discrimination via category-specific frame clustering for weakly-supervised temporal action localization","authors":"Huifen Xia, Yongzhao Zhan, Honglin Liu, Xiaopeng Ren","doi":"10.1631/fitee.2300024","DOIUrl":null,"url":null,"abstract":"<p>Temporal action localization (TAL) is a task of detecting the start and end timestamps of action instances and classifying them in an untrimmed video. As the number of action categories per video increases, existing weakly-supervised TAL (W-TAL) methods with only video-level labels cannot provide sufficient supervision. Single-frame supervision has attracted the interest of researchers. Existing paradigms model single-frame annotations from the perspective of video snippet sequences, neglect action discrimination of annotated frames, and do not pay sufficient attention to their correlations in the same category. Considering a category, the annotated frames exhibit distinctive appearance characteristics or clear action patterns. Thus, a novel method to enhance action discrimination via category-specific frame clustering for W-TAL is proposed. Specifically, the <i>K</i>-means clustering algorithm is employed to aggregate the annotated discriminative frames of the same category, which are regarded as exemplars to exhibit the characteristics of the action category. Then, the class activation scores are obtained by calculating the similarities between a frame and exemplars of various categories. Category-specific representation modeling can provide complimentary guidance to snippet sequence modeling in the mainline. As a result, a convex combination fusion mechanism is presented for annotated frames and snippet sequences to enhance the consistency properties of action discrimination, which can generate a robust class activation sequence for precise action classification and localization. Due to the supplementary guidance of action discriminative enhancement for video snippet sequences, our method outperforms existing single-frame annotation based methods. Experiments conducted on three datasets (THUMOS14, GTEA, and BEOID) show that our method achieves high localization performance compared with state-of-the-art methods.</p>","PeriodicalId":12608,"journal":{"name":"Frontiers of Information Technology & Electronic Engineering","volume":"27 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers of Information Technology & Electronic Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1631/fitee.2300024","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Temporal action localization (TAL) is a task of detecting the start and end timestamps of action instances and classifying them in an untrimmed video. As the number of action categories per video increases, existing weakly-supervised TAL (W-TAL) methods with only video-level labels cannot provide sufficient supervision. Single-frame supervision has attracted the interest of researchers. Existing paradigms model single-frame annotations from the perspective of video snippet sequences, neglect action discrimination of annotated frames, and do not pay sufficient attention to their correlations in the same category. Considering a category, the annotated frames exhibit distinctive appearance characteristics or clear action patterns. Thus, a novel method to enhance action discrimination via category-specific frame clustering for W-TAL is proposed. Specifically, the K-means clustering algorithm is employed to aggregate the annotated discriminative frames of the same category, which are regarded as exemplars to exhibit the characteristics of the action category. Then, the class activation scores are obtained by calculating the similarities between a frame and exemplars of various categories. Category-specific representation modeling can provide complimentary guidance to snippet sequence modeling in the mainline. As a result, a convex combination fusion mechanism is presented for annotated frames and snippet sequences to enhance the consistency properties of action discrimination, which can generate a robust class activation sequence for precise action classification and localization. Due to the supplementary guidance of action discriminative enhancement for video snippet sequences, our method outperforms existing single-frame annotation based methods. Experiments conducted on three datasets (THUMOS14, GTEA, and BEOID) show that our method achieves high localization performance compared with state-of-the-art methods.

查看原文本刊更多论文

通过特定类别的帧聚类来增强弱监督时间动作定位的动作识别能力

时态动作定位（TAL）是一项在未经剪辑的视频中检测动作实例的开始和结束时间戳并对其进行分类的任务。随着每个视频中动作类别数量的增加，仅使用视频级标签的现有弱监督 TAL（W-TAL）方法无法提供足够的监督。单帧监督引起了研究人员的兴趣。现有范例从视频片段序列的角度对单帧注释进行建模，忽略了注释帧的动作判别，也没有充分关注它们在同一类别中的相关性。考虑到一个类别，注释帧会表现出独特的外观特征或清晰的动作模式。因此，我们提出了一种新方法，通过对 W-TAL 中特定类别的帧进行聚类来增强动作分辨能力。具体来说，该方法采用 K-means 聚类算法来聚合同一类别的注释判别帧，这些帧被视为展示动作类别特征的典范。然后，通过计算帧与不同类别的示例之间的相似性，得到类别激活得分。特定类别的表示建模可以为主线中的片段序列建模提供补充指导。因此，我们提出了一种针对注释帧和片段序列的凸组合融合机制，以增强动作判别的一致性，从而生成稳健的类激活序列，用于精确的动作分类和定位。由于对视频片段序列动作判别增强的辅助指导，我们的方法优于现有的基于单帧注释的方法。在三个数据集（THUMOS14、GTEA 和 BEOID）上进行的实验表明，与最先进的方法相比，我们的方法实现了较高的定位性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers of Information Technology & Electronic Engineering COMPUTER SCIENCE, INFORMATION SYSTEMSCOMPU-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

6.00

自引率

10.00%

发文量

1372

期刊介绍： Frontiers of Information Technology & Electronic Engineering (ISSN 2095-9184, monthly), formerly known as Journal of Zhejiang University SCIENCE C (Computers & Electronics) (2010-2014), is an international peer-reviewed journal launched by Chinese Academy of Engineering (CAE) and Zhejiang University, co-published by Springer & Zhejiang University Press. FITEE is aimed to publish the latest implementation of applications, principles, and algorithms in the broad area of Electrical and Electronic Engineering, including but not limited to Computer Science, Information Sciences, Control, Automation, Telecommunications. There are different types of articles for your choice, including research articles, review articles, science letters, perspective, new technical notes and methods, etc.