Motion-transformer: self-supervised pre-training for skeleton-based action recognition

Proceedings of the 2nd ACM International Conference on Multimedia in Asia Pub Date : 2021-03-07 DOI:10.1145/3444685.3446289

Yi-Bin Cheng, Xipeng Chen, Dongyu Zhang, Liang Lin

{"title":"Motion-transformer: self-supervised pre-training for skeleton-based action recognition","authors":"Yi-Bin Cheng, Xipeng Chen, Dongyu Zhang, Liang Lin","doi":"10.1145/3444685.3446289","DOIUrl":null,"url":null,"abstract":"With the development of deep learning, skeleton-based action recognition has achieved great progress in recent years. However, most of the current works focus on extracting more informative spatial representations of the human body, but haven't made full use of the temporal dependencies already contained in the sequence of human action. To this end, we propose a novel transformer-based model called Motion-Transformer to sufficiently capture the temporal dependencies via self-supervised pre-training on the sequence of human action. Besides, we propose to predict the motion flow of human skeletons for better learning the temporal dependencies in sequence. The pre-trained model is then fine-tuned on the task of action recognition. Experimental results on the large scale NTU RGB+D dataset shows our model is effective in modeling temporal relation, and the flow prediction pre-training is beneficial to expose the inherent dependencies in time dimensional. With this pre-training and fine-tuning paradigm, our final model outperforms previous state-of-the-art methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3444685.3446289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

With the development of deep learning, skeleton-based action recognition has achieved great progress in recent years. However, most of the current works focus on extracting more informative spatial representations of the human body, but haven't made full use of the temporal dependencies already contained in the sequence of human action. To this end, we propose a novel transformer-based model called Motion-Transformer to sufficiently capture the temporal dependencies via self-supervised pre-training on the sequence of human action. Besides, we propose to predict the motion flow of human skeletons for better learning the temporal dependencies in sequence. The pre-trained model is then fine-tuned on the task of action recognition. Experimental results on the large scale NTU RGB+D dataset shows our model is effective in modeling temporal relation, and the flow prediction pre-training is beneficial to expose the inherent dependencies in time dimensional. With this pre-training and fine-tuning paradigm, our final model outperforms previous state-of-the-art methods.

查看原文本刊更多论文

运动转换器:基于骨骼的动作识别的自监督预训练

随着深度学习的发展，基于骨架的动作识别近年来取得了很大的进展。然而，目前的大部分工作都侧重于提取更多信息的人体空间表征，而没有充分利用人体动作序列中已经包含的时间依赖性。为此，我们提出了一种新的基于变压器的模型，称为Motion-Transformer，通过对人类动作序列的自监督预训练来充分捕获时间依赖性。此外，我们还提出了对人体骨骼运动流的预测，以便更好地学习序列上的时间依赖性。然后，预先训练的模型在动作识别任务上进行微调。在大规模NTU RGB+D数据集上的实验结果表明，该模型可以有效地建模时间关系，流量预测预训练有助于揭示时间维度上的内在依赖关系。通过这种预训练和微调范例，我们的最终模型优于之前最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

自引率

0.00%

发文量