{"title":"An effective fusion scheme of spatio-temporal features for human action recognition in RGB-D video","authors":"Quang D. Tran, N. Ly","doi":"10.1109/ICCAIS.2013.6720562","DOIUrl":null,"url":null,"abstract":"We investigate the problem of human action recognition by studying the effects of fusing feature streams retrieved from color and depth sequences. Our main contribution is two-fold: First, we present the so-called 3DS-HONV descriptor which is a spatio-temporal extension of Histogram of Oriented Normal vector (HONV), specifically designed for capturing the joint shape-motion vision cues from depth sequences; on the other hand, an effective RGB-D features fusion scheme, which exploits information from both color and depth channels, is developed to extract expressive representations for action recognition in real scenarios. As a result, despite its simplicity, our 3DS-HONV descriptor performs surprisingly well, and achieves the state-of-the-art performance on MSRAction3D dataset, which is 88.89% in overall accuracy. Further experiments demonstrate that our latter feature fusion scheme also generalizes well and achieves good results on the one-shot-learning ChaLearn Gesture Data (CGD2011).","PeriodicalId":347974,"journal":{"name":"2013 International Conference on Control, Automation and Information Sciences (ICCAIS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Control, Automation and Information Sciences (ICCAIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCAIS.2013.6720562","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
We investigate the problem of human action recognition by studying the effects of fusing feature streams retrieved from color and depth sequences. Our main contribution is two-fold: First, we present the so-called 3DS-HONV descriptor which is a spatio-temporal extension of Histogram of Oriented Normal vector (HONV), specifically designed for capturing the joint shape-motion vision cues from depth sequences; on the other hand, an effective RGB-D features fusion scheme, which exploits information from both color and depth channels, is developed to extract expressive representations for action recognition in real scenarios. As a result, despite its simplicity, our 3DS-HONV descriptor performs surprisingly well, and achieves the state-of-the-art performance on MSRAction3D dataset, which is 88.89% in overall accuracy. Further experiments demonstrate that our latter feature fusion scheme also generalizes well and achieves good results on the one-shot-learning ChaLearn Gesture Data (CGD2011).