Hierarchical Spatial-Temporal Masked Contrast for Skeleton Action Recognition

Wenming Cao;Aoyu Zhang;Zhihai He;Yicha Zhang;Xinpeng Yin
{"title":"Hierarchical Spatial-Temporal Masked Contrast for Skeleton Action Recognition","authors":"Wenming Cao;Aoyu Zhang;Zhihai He;Yicha Zhang;Xinpeng Yin","doi":"10.1109/TAI.2024.3430260","DOIUrl":null,"url":null,"abstract":"In the field of 3-D action recognition, self-supervised learning has shown promising results but remains a challenging task. Previous approaches to motion modeling often relied on selecting features solely from the temporal or spatial domain, which limited the extraction of higher-level semantic information. Additionally, traditional one-to-one approaches in multilevel comparative learning overlooked the relationships between different levels, hindering the learning representation of the model. To address these issues, we propose the hierarchical spatial-temporal masked network (HSTM) for learning 3-D action representations. HSTM introduces a novel masking method that operates simultaneously in both the temporal and spatial dimensions. This approach leverages semantic relevance to identify meaningful regions in time and space, guiding the masking process based on semantic richness. This guidance is crucial for learning useful feature representations effectively. Furthermore, to enhance the learning of potential features, we introduce cross-level distillation (CLD) to extend the comparative learning approach. By training the model with two types of losses simultaneously, each level of the multilevel comparative learning process can be guided by levels rich in semantic information. This allows for more effective supervision of comparative learning, leading to improved performance. Extensive experiments conducted on the NTU-60, NTU-120, and PKU-MMD datasets demonstrate the effectiveness of our proposed framework. The learned action representations exhibit strong transferability and achieve state-of-the-art results.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"5 11","pages":"5801-5814"},"PeriodicalIF":0.0000,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10601523/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In the field of 3-D action recognition, self-supervised learning has shown promising results but remains a challenging task. Previous approaches to motion modeling often relied on selecting features solely from the temporal or spatial domain, which limited the extraction of higher-level semantic information. Additionally, traditional one-to-one approaches in multilevel comparative learning overlooked the relationships between different levels, hindering the learning representation of the model. To address these issues, we propose the hierarchical spatial-temporal masked network (HSTM) for learning 3-D action representations. HSTM introduces a novel masking method that operates simultaneously in both the temporal and spatial dimensions. This approach leverages semantic relevance to identify meaningful regions in time and space, guiding the masking process based on semantic richness. This guidance is crucial for learning useful feature representations effectively. Furthermore, to enhance the learning of potential features, we introduce cross-level distillation (CLD) to extend the comparative learning approach. By training the model with two types of losses simultaneously, each level of the multilevel comparative learning process can be guided by levels rich in semantic information. This allows for more effective supervision of comparative learning, leading to improved performance. Extensive experiments conducted on the NTU-60, NTU-120, and PKU-MMD datasets demonstrate the effectiveness of our proposed framework. The learned action representations exhibit strong transferability and achieve state-of-the-art results.
用于骨骼动作识别的分层时空掩蔽对比技术
在三维动作识别领域,自监督学习已经取得了可喜的成果,但仍然是一项具有挑战性的任务。以往的运动建模方法通常只依赖于从时间或空间域中选择特征,这限制了对更高层次语义信息的提取。此外,多层次比较学习中传统的一对一方法忽略了不同层次之间的关系,阻碍了模型的学习表示。为了解决这些问题,我们提出了用于学习三维动作表征的分层时空遮蔽网络(HSTM)。HSTM 引入了一种在时间和空间维度上同时运行的新型遮蔽方法。这种方法利用语义相关性来识别时间和空间中的有意义区域,并根据语义丰富程度来指导屏蔽过程。这种指导对于有效学习有用的特征表征至关重要。此外,为了加强对潜在特征的学习,我们引入了跨层次蒸馏(CLD)来扩展比较学习方法。通过同时用两类损失对模型进行训练,多层次比较学习过程中的每个层次都能得到语义信息丰富的层次的指导。这样就能更有效地监督比较学习,从而提高性能。在 NTU-60、NTU-120 和 PKU-MMD 数据集上进行的广泛实验证明了我们提出的框架的有效性。学习到的动作表征具有很强的可移植性,并取得了最先进的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.70
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信