{"title":"Video Understanding via Convolutional Temporal Pooling Network and Multimodal Feature Fusion","authors":"Heeseung Kwon, Suha Kwak, Minsu Cho","doi":"10.1145/3265987.3265991","DOIUrl":null,"url":null,"abstract":"In this paper, we present a new end-to-end convolutional neural network architecture for video classification, and apply the model to action and scene recognition in untrimmed videos for the Challenge on Comprehensive Video Understanding in the Wild. The proposed architecture takes densely sampled video frames as inputs, and apply a temporal pooling operator inside the network to capture temporal context of the input video. As a result, our architecture outputs distinct video-level features with a set of different temporal pooling operators. Furthermore, we design a multimodal feature fusion model by concatenating our video-level features with those given in the challenge dataset. Experimental results on the challenge dataset demonstrate that the proposed architecture and the multimodal feature fusion approach together achieve outstanding performance in action and scene recognition.","PeriodicalId":151362,"journal":{"name":"Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3265987.3265991","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
In this paper, we present a new end-to-end convolutional neural network architecture for video classification, and apply the model to action and scene recognition in untrimmed videos for the Challenge on Comprehensive Video Understanding in the Wild. The proposed architecture takes densely sampled video frames as inputs, and apply a temporal pooling operator inside the network to capture temporal context of the input video. As a result, our architecture outputs distinct video-level features with a set of different temporal pooling operators. Furthermore, we design a multimodal feature fusion model by concatenating our video-level features with those given in the challenge dataset. Experimental results on the challenge dataset demonstrate that the proposed architecture and the multimodal feature fusion approach together achieve outstanding performance in action and scene recognition.