Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jiaying Liu
{"title":"Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition","authors":"Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jiaying Liu","doi":"10.1109/ICME.2018.8486486","DOIUrl":null,"url":null,"abstract":"This paper presents a new framework for action recognition with multi-modal data. A skeleton-indexed feature learning procedure is developed to further exploit the detailed local features from RGB and optical flow videos. In particular, the proposed framework is built based on a deep Convolutional Network (ConvNet) and a Recurrent Neural Network (RNN) with Long Short Term Memory (LSTM). A skeleton-indexed transform layer is designed to automatically extract visual features around key joints, and a part-aggregated pooling is developed to uniformly regulate the visual features from different body parts and actors. Besides, several fusion schemes are explored to take advantage of multi-modal data. The proposed deep architecture is end-to-end trainable and can better incorporate different modalities to learn effective feature representations. Quantitative experiment results on two datasets, the NTU RGB+D dataset and the MSR dataset, demonstrate the excellent performance of our scheme over other state-of-the-arts. To our knowledge, the performance obtained by the proposed framework is currently the best on the challenging NTU RGB+D dataset.","PeriodicalId":426613,"journal":{"name":"2018 IEEE International Conference on Multimedia and Expo (ICME)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME.2018.8486486","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22
Abstract
This paper presents a new framework for action recognition with multi-modal data. A skeleton-indexed feature learning procedure is developed to further exploit the detailed local features from RGB and optical flow videos. In particular, the proposed framework is built based on a deep Convolutional Network (ConvNet) and a Recurrent Neural Network (RNN) with Long Short Term Memory (LSTM). A skeleton-indexed transform layer is designed to automatically extract visual features around key joints, and a part-aggregated pooling is developed to uniformly regulate the visual features from different body parts and actors. Besides, several fusion schemes are explored to take advantage of multi-modal data. The proposed deep architecture is end-to-end trainable and can better incorporate different modalities to learn effective feature representations. Quantitative experiment results on two datasets, the NTU RGB+D dataset and the MSR dataset, demonstrate the excellent performance of our scheme over other state-of-the-arts. To our knowledge, the performance obtained by the proposed framework is currently the best on the challenging NTU RGB+D dataset.