Unsupervised Action Anticipation Through Action Cluster Prediction

IF 2.7 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE open journal of signal processing Pub Date : 2025-06-09 DOI:10.1109/OJSP.2025.3578300

Jiuxu Chen;Nupur Thakur;Sachin Chhabra;Baoxin Li

{"title":"Unsupervised Action Anticipation Through Action Cluster Prediction","authors":"Jiuxu Chen;Nupur Thakur;Sachin Chhabra;Baoxin Li","doi":"10.1109/OJSP.2025.3578300","DOIUrl":null,"url":null,"abstract":"Predicting near-future human actions in videos has become a focal point of research, driven by applications such as human-helping robotics, collaborative AI services, and surveillance video analysis. However, the inherent challenge lies in deciphering the complex spatial-temporal dynamics inherent in typical video feeds. While existing works excel in constrained settings with fine-grained action ground-truth labels, the general unavailability of such labeling at the frame level poses a significant hurdle. In this paper, we present an innovative solution to anticipate future human actions without relying on any form of supervision. Our approach involves generating pseudo-labels for video frames through the clustering of frame-wise visual features. These pseudo-labels are then input into a temporal sequence modeling module that learns to predict future actions in terms of pseudo-labels. Apart from the action anticipation method, we propose an innovative evaluation scheme GreedyMapper, a unique many-to-one mapping scheme that provides a practical solution to the many-to-one mapping challenge, a task that existing mapping algorithms struggle to address. Through comprehensive experimentation conducted on demanding real-world cooking datasets, our unsupervised method demonstrates superior performance compared to weakly-supervised approaches by a significant margin on the 50Salads dataset. When applied to the Breakfast dataset, our approach yields strong performance compared to the baselines in an unsupervised setting and delivers competitive results to (weakly) supervised methods under a similar setting.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"641-650"},"PeriodicalIF":2.7000,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11029147","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of signal processing","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11029147/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Predicting near-future human actions in videos has become a focal point of research, driven by applications such as human-helping robotics, collaborative AI services, and surveillance video analysis. However, the inherent challenge lies in deciphering the complex spatial-temporal dynamics inherent in typical video feeds. While existing works excel in constrained settings with fine-grained action ground-truth labels, the general unavailability of such labeling at the frame level poses a significant hurdle. In this paper, we present an innovative solution to anticipate future human actions without relying on any form of supervision. Our approach involves generating pseudo-labels for video frames through the clustering of frame-wise visual features. These pseudo-labels are then input into a temporal sequence modeling module that learns to predict future actions in terms of pseudo-labels. Apart from the action anticipation method, we propose an innovative evaluation scheme GreedyMapper, a unique many-to-one mapping scheme that provides a practical solution to the many-to-one mapping challenge, a task that existing mapping algorithms struggle to address. Through comprehensive experimentation conducted on demanding real-world cooking datasets, our unsupervised method demonstrates superior performance compared to weakly-supervised approaches by a significant margin on the 50Salads dataset. When applied to the Breakfast dataset, our approach yields strong performance compared to the baselines in an unsupervised setting and delivers competitive results to (weakly) supervised methods under a similar setting.

查看原文本刊更多论文

通过动作聚类预测的无监督动作预测

在人类帮助机器人、协作人工智能服务和监控视频分析等应用的推动下，预测视频中不久的未来人类行为已经成为研究的焦点。然而，固有的挑战在于解密典型视频馈送中固有的复杂时空动态。虽然现有的作品在具有细粒度动作基本事实标签的约束设置中表现出色，但在框架级别上这种标签的普遍不可用性构成了一个重大障碍。在本文中，我们提出了一种创新的解决方案，可以在不依赖任何形式的监督的情况下预测未来的人类行为。我们的方法包括通过聚类逐帧视觉特征为视频帧生成伪标签。然后将这些伪标签输入到时间序列建模模块中，该模块学习根据伪标签预测未来的动作。除了动作预测方法，我们还提出了一种创新的评估方案GreedyMapper，这是一种独特的多对一映射方案，为多对一映射挑战提供了实用的解决方案，这是现有映射算法难以解决的任务。通过对真实世界烹饪数据集进行的综合实验，我们的无监督方法在50salad数据集上比弱监督方法表现出更好的性能。当应用于早餐数据集时，与无监督设置的基线相比，我们的方法产生了强大的性能，并且在类似设置下提供了与（弱）监督方法相竞争的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊