{"title":"A Novel View Attention Network for Skeleton based Human Action Recognition*","authors":"Shaocheng Li, Zhen-yu Liu, Jianrong Tan","doi":"10.1109/CEECT53198.2021.9672614","DOIUrl":null,"url":null,"abstract":"Skeleton based human action recognition is becoming increasingly popular nowadays thanks to the development of low-cost depth sensors and pose estimation techniques. To enrich the expression of human skeletal characteristics and enhance generalization ability of models, new approaches are proposed to utilize human skeletons observed by multiple viewpoints for feature extracting. Despite the significant progress on these multi-view skeletons based approaches, the intrinsic correlation among views and fusion form of features have not been extensively investigated. In order to tackle these problems, we proposed a novel View Attention Network (VANet), which can learn the relationship of different views and fuse the multi-view features effectively. First, the spatio-temporal dynamics of human skeletons are encoded in the multi-view skeletal arrays. Then, a multi-branch Convolutional Neural Network (CNN) is adopted for extracting features from multiple views. Moreover, we design a view attention module to capture the correlation across different views. Particularly, we expend the module to a multi-head format to increase the feature spaces and enhance the robustness of entire network. Finally, an aggregated feature is learned from the module for final recognition. Extensive experiments on public NTU RGB+D 60 and SBU Kinect Interaction datasets show that our approach can achieve state-of-the-art results.","PeriodicalId":153030,"journal":{"name":"2021 3rd International Conference on Electrical Engineering and Control Technologies (CEECT)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 3rd International Conference on Electrical Engineering and Control Technologies (CEECT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEECT53198.2021.9672614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Skeleton based human action recognition is becoming increasingly popular nowadays thanks to the development of low-cost depth sensors and pose estimation techniques. To enrich the expression of human skeletal characteristics and enhance generalization ability of models, new approaches are proposed to utilize human skeletons observed by multiple viewpoints for feature extracting. Despite the significant progress on these multi-view skeletons based approaches, the intrinsic correlation among views and fusion form of features have not been extensively investigated. In order to tackle these problems, we proposed a novel View Attention Network (VANet), which can learn the relationship of different views and fuse the multi-view features effectively. First, the spatio-temporal dynamics of human skeletons are encoded in the multi-view skeletal arrays. Then, a multi-branch Convolutional Neural Network (CNN) is adopted for extracting features from multiple views. Moreover, we design a view attention module to capture the correlation across different views. Particularly, we expend the module to a multi-head format to increase the feature spaces and enhance the robustness of entire network. Finally, an aggregated feature is learned from the module for final recognition. Extensive experiments on public NTU RGB+D 60 and SBU Kinect Interaction datasets show that our approach can achieve state-of-the-art results.