{"title":"3-Stream Convolutional Networks for Video Action Recognition with Hybrid Motion Field","authors":"Wukui Yang, Shan Gao, Wenran Liu, Xiangyang Ji","doi":"10.1109/MMSP.2018.8547088","DOIUrl":null,"url":null,"abstract":"Two-stream based architectures for video action recognition exhibit great success recently. They encode the appearance with RGB frame, and the motion with optical flow. It is observed that optical flow depicts pixel-level motion field, focusing much on detail information, is hard to tackle the large displacement. In fact, human always focus the global motion rather than pixel-level motion. Inspired by this, we propose a novel 3-stream network structure with a spatial ConvNet, a pixel-level temporal ConvNet and a block-level temporal ConvNet. Integrating multi-granularity motion representation significantly outperforms single pixel-level motion field based architectures. Further, we can obtain the block-level motion vector field from compressed videos without extra calculation. We address missing and noisy motion patterns of motion vector field with intra-encoded block rectifying and flow guided filtering, building a hybrid motion field for our block-level temporal ConvNet. Our approach obtains state-of-the-art accuracy on UCF101 (95.27%) and HMDB 51 (69.21 %).","PeriodicalId":137522,"journal":{"name":"2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MMSP.2018.8547088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Two-stream based architectures for video action recognition exhibit great success recently. They encode the appearance with RGB frame, and the motion with optical flow. It is observed that optical flow depicts pixel-level motion field, focusing much on detail information, is hard to tackle the large displacement. In fact, human always focus the global motion rather than pixel-level motion. Inspired by this, we propose a novel 3-stream network structure with a spatial ConvNet, a pixel-level temporal ConvNet and a block-level temporal ConvNet. Integrating multi-granularity motion representation significantly outperforms single pixel-level motion field based architectures. Further, we can obtain the block-level motion vector field from compressed videos without extra calculation. We address missing and noisy motion patterns of motion vector field with intra-encoded block rectifying and flow guided filtering, building a hybrid motion field for our block-level temporal ConvNet. Our approach obtains state-of-the-art accuracy on UCF101 (95.27%) and HMDB 51 (69.21 %).