{"title":"Learning Selective Self-Mutual Attention for RGB-D Saliency Detection","authors":"Nian Liu, Ni Zhang, Junwei Han","doi":"10.1109/cvpr42600.2020.01377","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01377","url":null,"abstract":"Saliency detection on RGB-D images is receiving more and more research interests recently. Previous models adopt the early fusion or the result fusion scheme to fuse the input RGB and depth data or their saliency maps, which incur the problem of distribution gap or information loss. Some other models use the feature fusion scheme but are limited by the linear feature fusion methods. In this paper, we propose to fuse attention learned in both modalities. Inspired by the Non-local model, we integrate the self-attention and each other's attention to propagate long-range contextual dependencies, thus incorporating multi-modal information to learn attention and propagate contexts more accurately. Considering the reliability of the other modality's attention, we further propose a selection attention to weight the newly added attention term. We embed the proposed attention module in a two-stream CNN for RGB-D saliency detection. Furthermore, we also propose a residual fusion module to fuse the depth decoder features into the RGB stream. Experimental results on seven benchmark datasets demonstrate the effectiveness of the proposed model components and our final saliency model. Our code and saliency maps are available at https://github.com/nnizhang/S2MA.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"133 1","pages":"13753-13762"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75762155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muzammal Naseer, Salman Hameed Khan, Munawar Hayat, F. Khan, F. Porikli
{"title":"A Self-supervised Approach for Adversarial Robustness","authors":"Muzammal Naseer, Salman Hameed Khan, Munawar Hayat, F. Khan, F. Porikli","doi":"10.1109/cvpr42600.2020.00034","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00034","url":null,"abstract":"Adversarial examples can cause catastrophic mistakes in Deep Neural Network (DNNs) based vision systems e.g., for classification, segmentation and object detection. The vulnerability of DNNs against such attacks can prove a major roadblock towards their real-world deployment. Transferability of adversarial examples demand generalizable defenses that can provide cross-task protection. Adversarial training that enhances robustness by modifying target model's parameters lacks such generalizability. On the other hand, different input processing based defenses fall short in the face of continuously evolving attacks. In this paper, we take the first step to combine the benefits of both approaches and propose a self-supervised adversarial training mechanism in the input space. By design, our defense is a generalizable approach and provides significant robustness against the textbf{unseen} adversarial attacks (eg by reducing the success rate of translation-invariant textbf{ensemble} attack from 82.6% to 31.9% in comparison to previous state-of-the-art). It can be deployed as a plug-and-play solution to protect a variety of vision systems, as we demonstrate for the case of classification, segmentation and detection.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"46 1","pages":"259-268"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74450315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Akshay Dudhane, K. Biradar, Prashant W. Patil, Praful Hambarde, S. Murala
{"title":"Varicolored Image De-Hazing","authors":"Akshay Dudhane, K. Biradar, Prashant W. Patil, Praful Hambarde, S. Murala","doi":"10.1109/CVPR42600.2020.00462","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.00462","url":null,"abstract":"The quality of images captured in bad weather is often affected by chromatic casts and low visibility due to the presence of atmospheric particles. Restoration of the color balance is often ignored in most of the existing image de-hazing methods. In this paper, we propose a varicolored end-to-end image de-hazing network which restores the color balance in a given varicolored hazy image and recovers the haze-free image. The proposed network comprises of 1) Haze color correction (HCC) module and 2) Visibility improvement (VI) module. The proposed HCC module provides required attention to each color channel and generates a color balanced hazy image. While the proposed VI module processes the color balanced hazy image through novel inception attention block to recover the haze-free image. We also propose a novel approach to generate a large-scale varicolored synthetic hazy image database. An ablation study has been carried out to demonstrate the effect of different factors on the performance of the proposed network for image de-hazing. Three benchmark synthetic datasets have been used for quantitative analysis of the proposed network. Visual results on a set of real-world hazy images captured in different weather conditions demonstrate the effectiveness of the proposed approach for varicolored image de-hazing.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"20 1","pages":"4563-4572"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74754291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyu Jiang, Buyu Liu, S. Schulter, Zhangyang Wang, Manmohan Chandraker
{"title":"Peek-a-Boo: Occlusion Reasoning in Indoor Scenes With Plane Representations","authors":"Ziyu Jiang, Buyu Liu, S. Schulter, Zhangyang Wang, Manmohan Chandraker","doi":"10.1109/cvpr42600.2020.00019","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00019","url":null,"abstract":"We address the challenging task of occlusion-aware indoor 3D scene understanding. We represent scenes by a set of planes, where each one is defined by its normal, offset and two masks outlining (i) the extent of the visible part and (ii) the full region that consists of both visible and occluded parts of the plane. We infer these planes from a single input image with a novel neural network architecture. It consists of a two-branch category-specific module that aims to predict layout and objects of the scene separately so that different types of planes can be handled better. We also introduce a novel loss function based on plane warping that can leverage multiple views at training time for improved occlusion-aware reasoning. In order to train and evaluate our occlusion-reasoning model, we use the ScanNet dataset and propose (i) a strategy to automatically extract ground truth for both visible and hidden regions and (ii) a new evaluation metric that specifically focuses on the prediction in hidden regions. We empirically demonstrate that our proposed approach can achieve higher accuracy for occlusion reasoning compared to competitive baselines on the ScanNet dataset, e.g. 42.65% relative improvement on hidden regions.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"26 1","pages":"110-118"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73246375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward a Universal Model for Shape From Texture","authors":"Dor Verbin, Todd E. Zickler","doi":"10.1109/cvpr42600.2020.00050","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00050","url":null,"abstract":"We consider the shape from texture problem, where the input is a single image of a curved, textured surface, and the texture and shape are both a priori unknown. We formulate this task as a three-player game between a shape process, a texture process, and a discriminator. The discriminator adapts a set of non-linear filters to try to distinguish image patches created by the texture process from those created by the shape process, while the shape and texture processes try to create image patches that are indistinguishable from those of the other. An equilibrium of this game yields two things: an estimate of the 2.5D surface from the shape process, and a stochastic texture synthesis model from the texture process. Experiments show that this approach is robust to common non-idealities such as shading, gloss, and clutter. We also find that it succeeds for a wide variety of texture types, including both periodic textures and those composed of isolated textons, which have previously required distinct and specialized processing.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"6 1","pages":"419-427"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74399514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MemNAS: Memory-Efficient Neural Architecture Search With Grow-Trim Learning","authors":"Peiye Liu, Bo Wu, Huadong Ma, Mingoo Seok","doi":"10.1109/cvpr42600.2020.00218","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00218","url":null,"abstract":"Recent studies on automatic neural architecture search techniques have demonstrated significant performance, competitive to or even better than hand-crafted neural architectures. However, most of the existing search approaches tend to use residual structures and a concatenation connection between shallow and deep features. A resulted neural network model, therefore, is non-trivial for resource-constraint devices to execute since such a model requires large memory to store network parameters and intermediate feature maps along with excessive computing complexity. To address this challenge, we propose MemNAS, a novel growing and trimming based neural architecture search framework that optimizes not only performance but also memory requirement of an inference network. Specifically, in the search process, we consider running memory use, including network parameters and the essential intermediate feature maps memory requirement, as an optimization objective along with performance. Besides, to improve the accuracy of the search, we extract the correlation information among multiple candidate architectures to rank them and then choose the candidates with desired performance and memory efficiency. On the ImageNet classification task, our MemNAS achieves 75.4% accuracy, 0.7% higher than MobileNetV2 with 42.1% less memory requirement. Additional experiments confirm that the proposed MemNAS can perform well across the different targets of the trade-off between accuracy and memory consumption.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"27 1","pages":"2105-2113"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77492121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"What Machines See Is Not What They Get: Fooling Scene Text Recognition Models With Adversarial Text Images","authors":"Xing Xu, Jiefu Chen, Jinhui Xiao, Lianli Gao, Fumin Shen, Heng Tao Shen","doi":"10.1109/cvpr42600.2020.01232","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01232","url":null,"abstract":"The research on scene text recognition (STR) has made remarkable progress in recent years with the development of deep neural networks (DNNs). Recent studies on adversarial attack have verified that a DNN model designed for non-sequential tasks (e.g., classification, segmentation and retrieval) can be easily fooled by adversarial examples. Actually, STR is an application highly related to security issues. However, there are few studies considering the safety and reliability of STR models that make sequential prediction. In this paper, we make the first attempt in attacking the state-of-the-art DNN-based STR models. Specifically, we propose a novel and efficient optimization-based method that can be naturally integrated to different sequential prediction schemes, i.e., connectionist temporal classification (CTC) and attention mechanism. We apply our proposed method to five state-of-the-art STR models with both targeted and untargeted attack modes, the comprehensive results on 7 real-world datasets and 2 synthetic datasets consistently show the vulnerability of these STR models with a significant performance drop. Finally, we also test our attack method on a real-world STR engine of Baidu OCR, which demonstrates the practical potentials of our method.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"39 1","pages":"12301-12311"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80519017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Fluid Surface Reconstruction Using Deep Neural Network","authors":"Simron Thapa, Nianyi Li, Jinwei Ye","doi":"10.1109/cvpr42600.2020.00010","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00010","url":null,"abstract":"Recovering the dynamic fluid surface is a long-standing challenging problem in computer vision. Most existing image-based methods require multiple views or a dedicated imaging system. Here we present a learning-based single-image approach for 3D fluid surface reconstruction. Specifically, we design a deep neural network that estimates the depth and normal maps of a fluid surface by analyzing the refractive distortion of a reference background image. Due to the dynamic nature of fluid surfaces, our network uses recurrent layers that carry temporal information from previous frames to achieve spatio-temporally consistent reconstruction given a video input. Due to the lack of fluid data, we synthesize a large fluid dataset using physics-based fluid modeling and rendering techniques for network training and validation. Through experiments on simulated and real captured fluid images, we demonstrate that our proposed deep neural network trained on our fluid dataset can recover dynamic 3D fluid surfaces with high accuracy.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"28 1","pages":"21-30"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84222236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kanishka Rao, Chris Harris, A. Irpan, S. Levine, Julian Ibarz, Mohi Khansari
{"title":"RL-CycleGAN: Reinforcement Learning Aware Simulation-to-Real","authors":"Kanishka Rao, Chris Harris, A. Irpan, S. Levine, Julian Ibarz, Mohi Khansari","doi":"10.1109/CVPR42600.2020.01117","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.01117","url":null,"abstract":"Deep neural network based reinforcement learning (RL) can learn appropriate visual representations for complex tasks like vision-based robotic grasping without the need for manually engineering or prior learning a perception system. However, data for RL is collected via running an agent in the desired environment, and for applications like robotics, running a robot in the real world may be extremely costly and time consuming. Simulated training offers an appealing alternative, but ensuring that policies trained in simulation can transfer effectively into the real world requires additional machinery. Simulations may not match reality, and typically bridging the simulation-to-reality gap requires domain knowledge and task-specific engineering. We can automate this process by employing generative models to translate simulated images into realistic ones. However, this sort of translation is typically task-agnostic, in that the translated images may not preserve all features that are relevant to the task. In this paper, we introduce the RL-scene consistency loss for image translation, which ensures that the translation operation is invariant with respect to the Q-values associated with the image. This allows us to learn a task-aware translation. Incorporating this loss into unsupervised domain translation, we obtain the RL-CycleGAN, a new approach for simulation-to-real-world transfer for reinforcement learning. In evaluations of RL-CycleGAN on two vision-based robotics grasping tasks, we show that RL-CycleGAN offers a substantial improvement over a number of prior methods for sim-to-real transfer, attaining excellent real-world performance with only a modest number of real-world observations.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"3 1","pages":"11154-11163"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85819159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Kojima, Max Hamilton, Jia Deng
{"title":"OASIS: A Large-Scale Dataset for Single Image 3D in the Wild","authors":"Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Kojima, Max Hamilton, Jia Deng","doi":"10.1109/cvpr42600.2020.00076","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00076","url":null,"abstract":"Single-view 3D is the task of recovering 3D properties such as depth and surface normals from a single image. We hypothesize that a major obstacle to single-image 3D is data. We address this issue by presenting Open Annotations of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild consisting of annotations of detailed 3D geometry for 140,000 images. We train and evaluate leading models on a variety of single-image 3D tasks. We expect OASIS to be a useful resource for 3D vision research. Project site: https://pvl.cs.princeton.edu/OASIS.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"26 1","pages":"676-685"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78369066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}