{"title":"Semi-Supervised DFF: Decoupling Detection and Feature Flow for Video Object Detectors","authors":"Guangxing Han, Xuan Zhang, Chongrong Li","doi":"10.1145/3240508.3240693","DOIUrl":null,"url":null,"abstract":"For efficient video object detection, our detector consists of a spatial module and a temporal module. The spatial module aims to detect objects in static frames using convolutional networks, and the temporal module propagates high-level CNN features to nearby frames via light-weight feature flow. Alternating the spatial and temporal module by a proper interval makes our detector fast and accurate. Then we propose a two-stage semi-supervised learning framework to train our detector, which fully exploits unlabeled videos by decoupling the spatial and temporal module. In the first stage, the spatial module is learned by traditional supervised learning. In the second stage, we employ both feature regression loss and feature semantic loss to learn our temporal module via unsupervised learning. Different to traditional methods, our method can largely exploit unlabeled videos and bridges the gap of object detectors in image and video domain. Experiments on the large-scale ImageNet VID dataset demonstrate the effectiveness of our method. Code will be made publicly available.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"27 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240508.3240693","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
For efficient video object detection, our detector consists of a spatial module and a temporal module. The spatial module aims to detect objects in static frames using convolutional networks, and the temporal module propagates high-level CNN features to nearby frames via light-weight feature flow. Alternating the spatial and temporal module by a proper interval makes our detector fast and accurate. Then we propose a two-stage semi-supervised learning framework to train our detector, which fully exploits unlabeled videos by decoupling the spatial and temporal module. In the first stage, the spatial module is learned by traditional supervised learning. In the second stage, we employ both feature regression loss and feature semantic loss to learn our temporal module via unsupervised learning. Different to traditional methods, our method can largely exploit unlabeled videos and bridges the gap of object detectors in image and video domain. Experiments on the large-scale ImageNet VID dataset demonstrate the effectiveness of our method. Code will be made publicly available.