{"title":"Recognize Actions by Disentangling Components of Dynamics","authors":"Yue Zhao, Yuanjun Xiong, Dahua Lin","doi":"10.1109/CVPR.2018.00687","DOIUrl":null,"url":null,"abstract":"Despite the remarkable progress in action recognition over the past several years, existing methods remain limited in efficiency and effectiveness. The methods treating appearance and motion as separate streams are usually subject to the cost of optical flow computation, while those relying on 3D convolution on the original video frames often yield inferior performance in practice. In this paper, we propose a new ConvNet architecture for video representation learning, which can derive disentangled components of dynamics purely from raw video frames, without the need of optical flow estimation. Particularly, the learned representation comprises three components for representing static appearance, apparent motion, and appearance changes. We introduce 3D pooling, cost volume processing, and warped feature differences, respectively for extracting the three components above. These modules are incorporated as three branches in our unified network, which share the underlying features and are learned jointly in an end-to-end manner. On two large datasets, UCF101 [22] and Kinetics [16], our method obtained competitive performances with high efficiency, using only the RGB frame sequence as input.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"27 1","pages":"6566-6575"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"60","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2018.00687","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 60
Abstract
Despite the remarkable progress in action recognition over the past several years, existing methods remain limited in efficiency and effectiveness. The methods treating appearance and motion as separate streams are usually subject to the cost of optical flow computation, while those relying on 3D convolution on the original video frames often yield inferior performance in practice. In this paper, we propose a new ConvNet architecture for video representation learning, which can derive disentangled components of dynamics purely from raw video frames, without the need of optical flow estimation. Particularly, the learned representation comprises three components for representing static appearance, apparent motion, and appearance changes. We introduce 3D pooling, cost volume processing, and warped feature differences, respectively for extracting the three components above. These modules are incorporated as three branches in our unified network, which share the underlying features and are learned jointly in an end-to-end manner. On two large datasets, UCF101 [22] and Kinetics [16], our method obtained competitive performances with high efficiency, using only the RGB frame sequence as input.