Cross-Model Cross-Stream Learning for Self-Supervised Human Action Recognition

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Human-Machine Systems Pub Date : 2024-10-17 DOI:10.1109/THMS.2024.3467334

Mengyuan Liu;Hong Liu;Tianyu Guo

{"title":"Cross-Model Cross-Stream Learning for Self-Supervised Human Action Recognition","authors":"Mengyuan Liu;Hong Liu;Tianyu Guo","doi":"10.1109/THMS.2024.3467334","DOIUrl":null,"url":null,"abstract":"Considering the instance-level discriminative ability, contrastive learning methods, including MoCo and SimCLR, have been adapted from the original image representation learning task to solve the self-supervised skeleton-based action recognition task. These methods usually use multiple data streams (i.e., joint, motion, and bone) for ensemble learning, meanwhile, how to construct a discriminative feature space within a single stream and effectively aggregate the information from multiple streams remains an open problem. To this end, this article first applies a new contrastive learning method called bootstrap your own latent (BYOL) to learn from skeleton data, and then formulate SkeletonBYOL as a simple yet effective baseline for self-supervised skeleton-based action recognition. Inspired by SkeletonBYOL, this article further presents a cross-model and cross-stream (CMCS) framework. This framework combines cross-model adversarial learning (CMAL) and cross-stream collaborative learning (CSCL). Specifically, CMAL learns single-stream representation by cross-model adversarial loss to obtain more discriminative features. To aggregate and interact with multistream information, CSCL is designed by generating similarity pseudolabel of ensemble learning as supervision and guiding feature generation for individual streams. Extensive experiments on three datasets verify the complementary properties between CMAL and CSCL and also verify that the proposed method can achieve better results than state-of-the-art methods using various evaluation protocols.","PeriodicalId":48916,"journal":{"name":"IEEE Transactions on Human-Machine Systems","volume":"54 6","pages":"743-752"},"PeriodicalIF":3.5000,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Human-Machine Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10721222/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Considering the instance-level discriminative ability, contrastive learning methods, including MoCo and SimCLR, have been adapted from the original image representation learning task to solve the self-supervised skeleton-based action recognition task. These methods usually use multiple data streams (i.e., joint, motion, and bone) for ensemble learning, meanwhile, how to construct a discriminative feature space within a single stream and effectively aggregate the information from multiple streams remains an open problem. To this end, this article first applies a new contrastive learning method called bootstrap your own latent (BYOL) to learn from skeleton data, and then formulate SkeletonBYOL as a simple yet effective baseline for self-supervised skeleton-based action recognition. Inspired by SkeletonBYOL, this article further presents a cross-model and cross-stream (CMCS) framework. This framework combines cross-model adversarial learning (CMAL) and cross-stream collaborative learning (CSCL). Specifically, CMAL learns single-stream representation by cross-model adversarial loss to obtain more discriminative features. To aggregate and interact with multistream information, CSCL is designed by generating similarity pseudolabel of ensemble learning as supervision and guiding feature generation for individual streams. Extensive experiments on three datasets verify the complementary properties between CMAL and CSCL and also verify that the proposed method can achieve better results than state-of-the-art methods using various evaluation protocols.

查看原文本刊更多论文

用于自监督人类动作识别的跨模型跨流学习

考虑到实例级的判别能力，包括MoCo和SimCLR在内的对比学习方法已从原始的图像表示学习任务中调整出来，用于解决基于骨骼的自监督动作识别任务。这些方法通常使用多个数据流（即关节、运动和骨骼）进行集合学习，与此同时，如何在单个数据流中构建一个判别特征空间并有效聚合来自多个数据流的信息仍是一个有待解决的问题。为此，本文首先应用了一种新的对比学习方法--自举潜势（BYOL）来学习骨架数据，然后将 SkeletonBYOL 作为一种简单而有效的基于自我监督骨架的动作识别基线。受 SkeletonBYOL 的启发，本文进一步提出了跨模型和跨流（CMCS）框架。该框架结合了跨模型对抗学习（CMAL）和跨流协作学习（CSCL）。具体来说，CMAL 通过跨模型对抗损失来学习单流表示，从而获得更具区分性的特征。为了聚合多流信息并与之交互，CSCL 的设计是通过生成集合学习的相似性伪标签作为监督，并指导单个流的特征生成。在三个数据集上进行的大量实验验证了 CMAL 和 CSCL 之间的互补性，同时也验证了所提出的方法可以通过各种评估协议取得比最先进方法更好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Human-Machine Systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

7.10

自引率

11.10%

发文量

136

期刊介绍： The scope of the IEEE Transactions on Human-Machine Systems includes the fields of human machine systems. It covers human systems and human organizational interactions including cognitive ergonomics, system test and evaluation, and human information processing concerns in systems and organizations.