Anran Zhang, Lei Yue, Jiayi Shen, Fan Zhu, Xiantong Zhen, Xianbin Cao, Ling Shao
{"title":"Attentional Neural Fields for Crowd Counting","authors":"Anran Zhang, Lei Yue, Jiayi Shen, Fan Zhu, Xiantong Zhen, Xianbin Cao, Ling Shao","doi":"10.1109/ICCV.2019.00581","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00581","url":null,"abstract":"Crowd counting has recently generated huge popularity in computer vision, and is extremely challenging due to the huge scale variations of objects. In this paper, we propose the Attentional Neural Field (ANF) for crowd counting via density estimation. Within the encoder-decoder network, we introduce conditional random fields (CRFs) to aggregate multi-scale features, which can build more informative representations. To better model pair-wise potentials in CRFs, we incorperate non-local attention mechanism implemented as inter- and intra-layer attentions to expand the receptive field to the entire image respectively within the same layer and across different layers, which captures long-range dependencies to conquer huge scale variations. The CRFs coupled with the attention mechanism are seamlessly integrated into the encoder-decoder network, establishing an ANF that can be optimized end-to-end by back propagation. We conduct extensive experiments on four public datasets, including ShanghaiTech, WorldEXPO 10, UCF-CC-50 and UCF-QNRF. The results show that our ANF achieves high counting performance, surpassing most previous methods.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"15 1","pages":"5713-5722"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73993702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Nie, R. Anwer, Hisham Cholakkal, F. Khan, Yanwei Pang, Ling Shao
{"title":"Enriched Feature Guided Refinement Network for Object Detection","authors":"Jing Nie, R. Anwer, Hisham Cholakkal, F. Khan, Yanwei Pang, Ling Shao","doi":"10.1109/ICCV.2019.00963","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00963","url":null,"abstract":"We propose a single-stage detection framework that jointly tackles the problem of multi-scale object detection and class imbalance. Rather than designing deeper networks, we introduce a simple yet effective feature enrichment scheme to produce multi-scale contextual features. We further introduce a cascaded refinement scheme which first instills multi-scale contextual features into the prediction layers of the single-stage detector in order to enrich their discriminative power for multi-scale detection. Second, the cascaded refinement scheme counters the class imbalance problem by refining the anchors and enriched features to improve classification and regression. Experiments are performed on two benchmarks: PASCAL VOC and MS COCO. For a 320×320 input on the MS COCO test-dev, our detector achieves state-of-the-art single-stage detection accuracy with a COCO AP of 33.2 in the case of single-scale inference, while operating at 21 milliseconds on a Titan XP GPU. For a 512×512 input on the MS COCO test-dev, our approach obtains an absolute gain of 1.6% in terms of COCO AP, compared to the best reported single-stage results[5]. Source code and models are available at: https://github.com/Ranchentx/EFGRNet.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"10 1","pages":"9536-9545"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75272716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yong Zhang, Haiyong Jiang, Baoyuan Wu, Yanbo Fan, Q. Ji
{"title":"Context-Aware Feature and Label Fusion for Facial Action Unit Intensity Estimation With Partially Labeled Data","authors":"Yong Zhang, Haiyong Jiang, Baoyuan Wu, Yanbo Fan, Q. Ji","doi":"10.1109/ICCV.2019.00082","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00082","url":null,"abstract":"Facial action unit (AU) intensity estimation is a fundamental task for facial behaviour analysis. Most previous methods use a whole face image as input for intensity prediction. Considering that AUs are defined according to their corresponding local appearance, a few patch-based methods utilize image features of local patches. However, fusion of local features is always performed via straightforward feature concatenation or summation. Besides, these methods require fully annotated databases for model learning, which is expensive to acquire. In this paper, we propose a novel weakly supervised patch-based deep model on basis of two types of attention mechanisms for joint intensity estimation of multiple AUs. The model consists of a feature fusion module and a label fusion module. And we augment attention mechanisms of these two modules with a learnable task-related context, as one patch may play different roles in analyzing different AUs and each AU has its own temporal evolution rule. The context-aware feature fusion module is used to capture spatial relationships among local patches while the context-aware label fusion module is used to capture the temporal dynamics of AUs. The latter enables the model to be trained on a partially annotated database. Experimental evaluations on two benchmark expression databases demonstrate the superior performance of the proposed method.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"6 1","pages":"733-742"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74712648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tatsuya Yokota, Kazuya Kawai, M. Sakata, Y. Kimura, H. Hontani
{"title":"Dynamic PET Image Reconstruction Using Nonnegative Matrix Factorization Incorporated With Deep Image Prior","authors":"Tatsuya Yokota, Kazuya Kawai, M. Sakata, Y. Kimura, H. Hontani","doi":"10.1109/ICCV.2019.00322","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00322","url":null,"abstract":"We propose a method that reconstructs dynamic positron emission tomography (PET) images from given sinograms by using non-negative matrix factorization (NMF) incorporated with a deep image prior (DIP) for appropriately constraining the spatial patterns of resultant images. The proposed method can reconstruct dynamic PET images with higher signal-to-noise ratio (SNR) and blindly decompose an image matrix into pairs of spatial and temporal factors. The former represent homogeneous tissues with different kinetic parameters and the latter represent the time activity curves that are observed in the corresponding homogeneous tissues. We employ U-Nets combined in parallel for DIP and each of the U-nets is used to extract each spatial factor decomposed from the data matrix. Experimental results show that the proposed method outperforms conventional methods and can extract spatial factors that represent the homogeneous tissues.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"31 1","pages":"3126-3135"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73203630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Face Video Deblurring Using 3D Facial Priors","authors":"Wenqi Ren, Jiaolong Yang, Senyou Deng, D. Wipf, Xiaochun Cao, Xin Tong","doi":"10.1109/ICCV.2019.00948","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00948","url":null,"abstract":"Existing face deblurring methods only consider single frames and do not account for facial structure and identity information. These methods struggle to deblur face videos that exhibit significant pose variations and misalignment. In this paper we propose a novel face video deblurring network capitalizing on 3D facial priors. The model consists of two main branches: i) a face video deblurring sub-network based on an encoder-decoder architecture, and ii) a 3D face reconstruction and rendering branch for predicting 3D priors of salient facial structures and identity knowledge. These structures encourage the deblurring branch to generate sharp faces with detailed structures. Our method not only uses low-level information (i.e., image intensity), but also middle-level information (i.e., 3D facial structure) and high-level knowledge (i.e., identity content) to further explore spatial constraints of facial components from blurry face frames. Extensive experimental results demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"30 1","pages":"9387-9396"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73214623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distillation-Based Training for Multi-Exit Architectures","authors":"Mary Phuong, Christoph H. Lampert","doi":"10.1109/ICCV.2019.00144","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00144","url":null,"abstract":"Multi-exit architectures, in which a stack of processing layers is interleaved with early output layers, allow the processing of a test example to stop early and thus save computation time and/or energy. In this work, we propose a new training procedure for multi-exit architectures based on the principle of knowledge distillation. The method encourages early exits to mimic later, more accurate exits, by matching their probability outputs. Experiments on CIFAR100 and ImageNet show that distillation-based training significantly improves the accuracy of early exits while maintaining state-of-the-art accuracy for late ones. The method is particularly beneficial when training data is limited and also allows a straight-forward extension to semi-supervised learning, i.e. make use also of unlabeled data at training time. Moreover, it takes only a few lines to implement and imposes almost no computational overhead at training time, and none at all at test time.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"1 1","pages":"1355-1364"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75313299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Class Part Parsing With Joint Boundary-Semantic Awareness","authors":"Yifan Zhao, Jia Li, Yu Zhang, Yonghong Tian","doi":"10.1109/ICCV.2019.00927","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00927","url":null,"abstract":"Object part parsing in the wild, which requires to simultaneously detect multiple object classes in the scene and accurately segments semantic parts within each class, is challenging for the joint presence of class-level and part-level ambiguities. Despite its importance, however, this problem is not sufficiently explored in existing works. In this paper, we propose a joint parsing framework with boundary and semantic awareness to address this challenging problem. To handle part-level ambiguity, a boundary awareness module is proposed to make mid-level features at multiple scales attend to part boundaries for accurate part localization, which are then fused with high-level features for effective part recognition. For class-level ambiguity, we further present a semantic awareness module that selects discriminative part features relevant to a category to prevent irrelevant features being merged together. The proposed modules are lightweight and implementation friendly, improving the performance substantially when plugged into various baseline architectures. Without bells and whistles, the full model sets new state-of-the-art results on the Pascal-Part dataset, in both multi-class and the conventional single-class setting, while running substantially faster than recent high-performance approaches.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"2 1","pages":"9176-9185"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75479109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Elaborate Monocular Point and Line SLAM With Robust Initialization","authors":"Sang Jun Lee, S. Hwang","doi":"10.1109/ICCV.2019.00121","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00121","url":null,"abstract":"This paper presents a monocular indirect SLAM system which performs robust initialization and accurate localization. For initialization, we utilize a matrix factorization-based method. Matrix factorization-based methods require that extracted feature points must be tracked in all used frames. Since consistent tracking is difficult in challenging environments, a geometric interpolation that utilizes epipolar geometry is proposed. For localization, 3D lines are utilized. We propose the use of Plu ̈cker line coordinates to represent geometric information of lines. We also propose orthonormal representation of Plu ̈cker line coordinates and Jacobians of lines for better optimization. Experimental results show that the proposed initialization generates consistent and robust map in linear time with fast convergence even in challenging scenes. And localization using proposed line representations is faster, more accurate and memory efficient than other state-of-the-art methods.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"56 1","pages":"1121-1129"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75098151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Align, Attend and Locate: Chest X-Ray Diagnosis via Contrast Induced Attention Network With Limited Supervision","authors":"Jingyun Liu, Gangming Zhao, Yu Fei, Ming Zhang, Yizhou Wang, Yizhou Yu","doi":"10.1109/ICCV.2019.01073","DOIUrl":"https://doi.org/10.1109/ICCV.2019.01073","url":null,"abstract":"Obstacles facing accurate identification and localization of diseases in chest X-ray images lie in the lack of high-quality images and annotations. In this paper, we propose a Contrast Induced Attention Network (CIA-Net), which exploits the highly structured property of chest X-ray images and localizes diseases via contrastive learning on the aligned positive and negative samples. To force the attention module to focus only on sites of abnormalities, we also introduce a learnable alignment module to adjust all the input images, which eliminates variations of scales, angles, and displacements of X-ray images generated under bad scan conditions. We show that the use of contrastive attention and alignment module allows the model to learn rich identification and localization information using only a small amount of location annotations, resulting in state-of-the-art performance in NIH chest X-ray dataset.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"177 1","pages":"10631-10640"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78031119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Density Map Generation for Crowd Counting","authors":"Jia Wan, Antoni B. Chan","doi":"10.1109/ICCV.2019.00122","DOIUrl":"https://doi.org/10.1109/ICCV.2019.00122","url":null,"abstract":"Crowd counting is an important topic in computer vision due to its practical usage in surveillance systems. The typical design of crowd counting algorithms is divided into two steps. First, the ground-truth density maps of crowd images are generated from the ground-truth dot maps (density map generation), e.g., by convolving with a Gaussian kernel. Second, deep learning models are designed to predict a density map from an input image (density map estimation). Most research efforts have concentrated on the density map estimation problem, while the problem of density map generation has not been adequately explored. In particular, the density map could be considered as an intermediate representation used to train a crowd counting network. In the sense of end-to-end training, the hand-crafted methods used for generating the density maps may not be optimal for the particular network or dataset used. To address this issue, we first show the impact of different density maps and that better ground-truth density maps can be obtained by refining the existing ones using a learned refinement network, which is jointly trained with the counter. Then, we propose an adaptive density map generator, which takes the annotation dot map as input, and learns a density map representation for a counter. The counter and generator are trained jointly within an end-to-end framework. The experiment results on popular counting datasets confirm the effectiveness of the proposed learnable density map representations.","PeriodicalId":6728,"journal":{"name":"2019 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"167 1","pages":"1130-1139"},"PeriodicalIF":0.0,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80492265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}