A Novel View Attention Network for Skeleton based Human Action Recognition*

2021 3rd International Conference on Electrical Engineering and Control Technologies (CEECT) Pub Date : 2021-12-01 DOI:10.1109/CEECT53198.2021.9672614

Shaocheng Li, Zhen-yu Liu, Jianrong Tan

{"title":"A Novel View Attention Network for Skeleton based Human Action Recognition*","authors":"Shaocheng Li, Zhen-yu Liu, Jianrong Tan","doi":"10.1109/CEECT53198.2021.9672614","DOIUrl":null,"url":null,"abstract":"Skeleton based human action recognition is becoming increasingly popular nowadays thanks to the development of low-cost depth sensors and pose estimation techniques. To enrich the expression of human skeletal characteristics and enhance generalization ability of models, new approaches are proposed to utilize human skeletons observed by multiple viewpoints for feature extracting. Despite the significant progress on these multi-view skeletons based approaches, the intrinsic correlation among views and fusion form of features have not been extensively investigated. In order to tackle these problems, we proposed a novel View Attention Network (VANet), which can learn the relationship of different views and fuse the multi-view features effectively. First, the spatio-temporal dynamics of human skeletons are encoded in the multi-view skeletal arrays. Then, a multi-branch Convolutional Neural Network (CNN) is adopted for extracting features from multiple views. Moreover, we design a view attention module to capture the correlation across different views. Particularly, we expend the module to a multi-head format to increase the feature spaces and enhance the robustness of entire network. Finally, an aggregated feature is learned from the module for final recognition. Extensive experiments on public NTU RGB+D 60 and SBU Kinect Interaction datasets show that our approach can achieve state-of-the-art results.","PeriodicalId":153030,"journal":{"name":"2021 3rd International Conference on Electrical Engineering and Control Technologies (CEECT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd International Conference on Electrical Engineering and Control Technologies (CEECT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEECT53198.2021.9672614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Skeleton based human action recognition is becoming increasingly popular nowadays thanks to the development of low-cost depth sensors and pose estimation techniques. To enrich the expression of human skeletal characteristics and enhance generalization ability of models, new approaches are proposed to utilize human skeletons observed by multiple viewpoints for feature extracting. Despite the significant progress on these multi-view skeletons based approaches, the intrinsic correlation among views and fusion form of features have not been extensively investigated. In order to tackle these problems, we proposed a novel View Attention Network (VANet), which can learn the relationship of different views and fuse the multi-view features effectively. First, the spatio-temporal dynamics of human skeletons are encoded in the multi-view skeletal arrays. Then, a multi-branch Convolutional Neural Network (CNN) is adopted for extracting features from multiple views. Moreover, we design a view attention module to capture the correlation across different views. Particularly, we expend the module to a multi-head format to increase the feature spaces and enhance the robustness of entire network. Finally, an aggregated feature is learned from the module for final recognition. Extensive experiments on public NTU RGB+D 60 and SBU Kinect Interaction datasets show that our approach can achieve state-of-the-art results.

查看原文本刊更多论文

一种基于骨骼的人体动作识别新视角注意网络*

由于低成本的深度传感器和姿态估计技术的发展，基于骨骼的人体动作识别越来越受欢迎。为了丰富人体骨骼特征的表达，提高模型的泛化能力，提出了利用多视点观测的人体骨骼进行特征提取的新方法。尽管这些基于多视图骨架的方法取得了重大进展，但视图之间的内在相关性和特征融合形式尚未得到广泛的研究。为了解决这些问题，我们提出了一种新的视图注意网络(VANet)，它可以有效地学习不同视图之间的关系，并融合多视图特征。首先，将人体骨骼的时空动态特征编码到多视图骨骼阵列中。然后，采用多分支卷积神经网络(CNN)从多个视图中提取特征;此外，我们设计了一个视图注意模块来捕捉不同视图之间的相关性。特别地，我们将该模块扩展为多头格式，以增加特征空间并增强整个网络的鲁棒性。最后，从模块中学习一个聚合特征进行最终识别。在NTU RGB+ d60和SBU Kinect交互数据集上进行的大量实验表明，我们的方法可以达到最先进的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 3rd International Conference on Electrical Engineering and Control Technologies (CEECT)

自引率

0.00%

发文量