A unified framework for unsupervised action learning via global-to-local motion transformer

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2024-11-01 DOI:10.1016/j.patcog.2024.111118

Boeun Kim , Jungho Kim , Hyung Jin Chang , Tae-Hyun Oh

{"title":"A unified framework for unsupervised action learning via global-to-local motion transformer","authors":"Boeun Kim , Jungho Kim , Hyung Jin Chang , Tae-Hyun Oh","doi":"10.1016/j.patcog.2024.111118","DOIUrl":null,"url":null,"abstract":"<div><div>Human action recognition remains challenging due to the inherent complexity arising from the combination of diverse granularity of semantics, ranging from the local motion of body joints to high-level relationships across multiple people. To learn this multi-level characteristic of human action in an unsupervised manner, we propose a novel pretraining strategy along with a transformer-based model architecture named <em>GL-Transformer++</em>. Prior methods in unsupervised action recognition or unsupervised group activity recognition (GAR) have shown limitations, often focusing solely on capturing a partial scope of the action, such as the local movements of each individual or the broader context of the overall motion. To tackle this problem, we introduce a novel pretraining strategy named <em>multi-interval pose displacement prediction (MPDP)</em> that enables the model to learn the diverse extents of the action. In the architectural aspect, we incorporate the <em>global and local attention (GLA)</em> mechanism within the transformer blocks to learn local dynamics between joints, global context of each individual, as well as high-level interpersonal relationships in both spatial and temporal manner. In fact, the proposed method is a unified approach that demonstrates efficacy in both action recognition and GAR. Particularly, our method presents a new and strong baseline, surpassing the current SOTA GAR method by significant margins: 29.6% in Volleyball and 60.3% and 59.9% on the xsub and xset settings of the Mutual NTU dataset, respectively.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111118"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008690","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Human action recognition remains challenging due to the inherent complexity arising from the combination of diverse granularity of semantics, ranging from the local motion of body joints to high-level relationships across multiple people. To learn this multi-level characteristic of human action in an unsupervised manner, we propose a novel pretraining strategy along with a transformer-based model architecture named GL-Transformer++. Prior methods in unsupervised action recognition or unsupervised group activity recognition (GAR) have shown limitations, often focusing solely on capturing a partial scope of the action, such as the local movements of each individual or the broader context of the overall motion. To tackle this problem, we introduce a novel pretraining strategy named multi-interval pose displacement prediction (MPDP) that enables the model to learn the diverse extents of the action. In the architectural aspect, we incorporate the global and local attention (GLA) mechanism within the transformer blocks to learn local dynamics between joints, global context of each individual, as well as high-level interpersonal relationships in both spatial and temporal manner. In fact, the proposed method is a unified approach that demonstrates efficacy in both action recognition and GAR. Particularly, our method presents a new and strong baseline, surpassing the current SOTA GAR method by significant margins: 29.6% in Volleyball and 60.3% and 59.9% on the xsub and xset settings of the Mutual NTU dataset, respectively.

查看原文本刊更多论文

通过全局到局部运动变换器实现无监督动作学习的统一框架

从身体关节的局部运动到多人之间的高层次关系，各种语义粒度的组合产生了固有的复杂性，因此人类动作识别仍然具有挑战性。为了在无监督的情况下学习人类动作的这种多层次特征，我们提出了一种新颖的预训练策略以及一种基于转换器的模型架构，命名为 GL-Transformer++。之前的无监督动作识别或无监督群体活动识别（GAR）方法存在局限性，通常只能捕捉动作的部分范围，如每个人的局部动作或整体动作的大背景。为了解决这个问题，我们引入了一种名为多区间姿势位移预测（MPDP）的新型预训练策略，使模型能够学习动作的不同范围。在架构方面，我们将全局和局部注意力（GLA）机制纳入变压器模块，以学习关节间的局部动态、每个个体的全局上下文以及高层次的空间和时间人际关系。事实上，所提出的方法是一种统一的方法，在动作识别和 GAR 方面都显示出了功效。特别是，我们的方法提出了一个新的、强大的基线，大大超过了目前的 SOTA GAR 方法：在排球比赛中超过了 29.6%，在 Mutual NTU 数据集的 xsub 和 xset 设置中分别超过了 60.3% 和 59.9%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.