Xibin Song, Yuchao Dai, Dingfu Zhou, Liu Liu, Wei Li, H. Li, Ruigang Yang
{"title":"Channel Attention Based Iterative Residual Learning for Depth Map Super-Resolution","authors":"Xibin Song, Yuchao Dai, Dingfu Zhou, Liu Liu, Wei Li, H. Li, Ruigang Yang","doi":"10.1109/cvpr42600.2020.00567","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00567","url":null,"abstract":"Despite the remarkable progresses made in deep learning based depth map super-resolution (DSR), how to tackle real-world degradation in low-resolution (LR) depth maps remains a major challenge. Existing DSR model is generally trained and tested on synthetic dataset, which is very different from what would get from a real depth sensor. In this paper, we argue that DSR models trained under this setting are restrictive and not effective in dealing with realworld DSR tasks. We make two contributions in tackling real-world degradation of different depth sensors. First, we propose to classify the generation of LR depth maps into two types: non-linear downsampling with noise and interval downsampling, for which DSR models are learned correspondingly. Second, we propose a new framework for real-world DSR, which consists of four modules : 1) An iterative residual learning module with deep supervision to learn effective high-frequency components of depth maps in a coarse-to-fine manner; 2) A channel attention strategy to enhance channels with abundant high-frequency components; 3) A multi-stage fusion module to effectively reexploit the results in the coarse-to-fine process; and 4) A depth refinement module to improve the depth map by TGV regularization and input loss. Extensive experiments on benchmarking datasets demonstrate the superiority of our method over current state-of-the-art DSR methods.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"257 1","pages":"5630-5639"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77047520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaiyue Pang, Yongxin Yang, Timothy M. Hospedales, T. Xiang, Yi-Zhe Song
{"title":"Solving Mixed-Modal Jigsaw Puzzle for Fine-Grained Sketch-Based Image Retrieval","authors":"Kaiyue Pang, Yongxin Yang, Timothy M. Hospedales, T. Xiang, Yi-Zhe Song","doi":"10.1109/cvpr42600.2020.01036","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01036","url":null,"abstract":"ImageNet pre-training has long been considered crucial by the fine-grained sketch-based image retrieval (FG-SBIR) community due to the lack of large sketch-photo paired datasets for FG-SBIR training. In this paper, we propose a self-supervised alternative for representation pre-training. Specifically, we consider the jigsaw puzzle game of recomposing images from shuffled parts. We identify two key facets of jigsaw task design that are required for effective FG-SBIR pre-training. The first is formulating the puzzle in a mixed-modality fashion. Second we show that framing the optimisation as permutation matrix inference via Sinkhorn iterations is more effective than the common classifier formulation of Jigsaw self-supervision. Experiments show that this self-supervised pre-training strategy significantly outperforms the standard ImageNet-based pipeline across all four product-level FG-SBIR benchmarks. Interestingly it also leads to improved cross-category generalisation across both pre-train/fine-tune and fine-tune/testing stages.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"33 1","pages":"10344-10352"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77307333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sinan Wang, Xinyang Chen, Yunbo Wang, Mingsheng Long, Jianmin Wang
{"title":"Progressive Adversarial Networks for Fine-Grained Domain Adaptation","authors":"Sinan Wang, Xinyang Chen, Yunbo Wang, Mingsheng Long, Jianmin Wang","doi":"10.1109/cvpr42600.2020.00923","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00923","url":null,"abstract":"Fine-grained visual categorization has long been considered as an important problem, however, its real application is still restricted, since precisely annotating a large fine-grained image dataset is a laborious task and requires expert-level human knowledge. A solution to this problem is applying domain adaptation approaches to fine-grained scenarios, where the key idea is to discover the commonality between existing fine-grained image datasets and massive unlabeled data in the wild. The main technical bottleneck lies in that the large inter-domain variation will deteriorate the subtle boundaries of small inter-class variation during domain alignment. This paper presents the Progressive Adversarial Networks (PAN) to align fine-grained categories across domains with a curriculum-based adversarial learning framework. In particular, throughout the learning process, domain adaptation is carried out through all multi-grained features, progressively exploiting the label hierarchy from coarse to fine. The progressive learning is applied upon both category classification and domain alignment, boosting both the discriminability and the transferability of the fine-grained features. Our method is evaluated on three benchmarks, two of which are proposed by us, and it outperforms the state-of-the-art domain adaptation methods.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"35 1","pages":"9210-9219"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76554915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Babiloni, Ioannis Marras, G. Slabaugh, S. Zafeiriou
{"title":"TESA: Tensor Element Self-Attention via Matricization","authors":"F. Babiloni, Ioannis Marras, G. Slabaugh, S. Zafeiriou","doi":"10.1109/cvpr42600.2020.01396","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01396","url":null,"abstract":"Representation learning is a fundamental part of modern computer vision, where abstract representations of data are encoded as tensors optimized to solve problems like image segmentation and inpainting. Recently, self-attention in the form of Non-Local Block has emerged as a powerful technique to enrich features, by capturing complex interdependencies in feature tensors. However, standard self-attention approaches leverage only spatial relationships, drawing similarities between vectors and overlooking correlations between channels. In this paper, we introduce a new method, called Tensor Element Self-Attention (TESA) that generalizes such work to capture interdependencies along all dimensions of the tensor using matricization. An order R tensor produces R results, one for each dimension. The results are then fused to produce an enriched output which encapsulates similarity among tensor elements. Additionally, we analyze self-attention mathematically, providing new perspectives on how it adjusts the singular values of the input feature tensor. With these new insights, we present experimental results demonstrating how TESA can benefit diverse problems including classification and instance segmentation. By simply adding a TESA module to existing networks, we substantially improve competitive baselines and set new state-of-the-art results for image inpainting on Celeb and low light raw-to-rgb image translation on SID.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"268 1","pages":"13942-13951"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77801357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-Supervised Domain-Aware Generative Network for Generalized Zero-Shot Learning","authors":"Jiamin Wu, Tianzhu Zhang, Zhengjun Zha, Jiebo Luo, Yongdong Zhang, Feng Wu","doi":"10.1109/CVPR42600.2020.01278","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.01278","url":null,"abstract":"Generalized Zero-Shot Learning (GZSL) aims at recognizing both seen and unseen classes by constructing correspondence between visual and semantic embedding. However, existing methods have severely suffered from the strong bias problem, where unseen instances in target domain tend to be recognized as seen classes in source domain. To address this issue, we propose an end-to-end Self-supervised Domain-aware Generative Network (SDGN) by integrating self-supervised learning into feature generating model for unbiased GZSL. The proposed SDGN model enjoys several merits. First, we design a cross-domain feature generating module to synthesize samples with high fidelity based on class embeddings, which involves a novel target domain discriminator to preserve the domain consistency. Second, we propose a self-supervised learning module to investigate inter-domain relationships, where a set of anchors are introduced as a bridge between seen and unseen categories. In the shared space, we pull the distribution of target domain away from source domain, and obtain domain-aware features with high discriminative power for both seen and unseen classes. To our best knowledge, this is the first work to introduce self-supervised learning into GZSL as a learning guidance. Extensive experimental results on five standard benchmarks demonstrate that our model performs favorably against state-of-the-art GZSL methods.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"22 1","pages":"12764-12773"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77863495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Qiao, Yuhao Liu, Xin Yang, D. Zhou, Mingliang Xu, Qiang Zhang, Xiaopeng Wei
{"title":"Attention-Guided Hierarchical Structure Aggregation for Image Matting","authors":"Y. Qiao, Yuhao Liu, Xin Yang, D. Zhou, Mingliang Xu, Qiang Zhang, Xiaopeng Wei","doi":"10.1109/CVPR42600.2020.01369","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.01369","url":null,"abstract":"Existing deep learning based matting algorithms primarily resort to high-level semantic features to improve the overall structure of alpha mattes. However, we argue that advanced semantics extracted from CNNs contribute unequally for alpha perception and we are supposed to reconcile advanced semantic information with low-level appearance cues to refine the foreground details. In this paper, we propose an end-to-end Hierarchical Attention Matting Network (HAttMatting), which can predict the better structure of alpha mattes from single RGB images without additional input. Specifically, we employ spatial and channel-wise attention to integrate appearance cues and pyramidal features in a novel fashion. This blended attention mechanism can perceive alpha mattes from refined boundaries and adaptive semantics. We also introduce a hybrid loss function fusing Structural SIMilarity (SSIM), Mean Square Error (MSE) and Adversarial loss to guide the network to further improve the overall foreground structure. Besides, we construct a large-scale image matting dataset comprised of 59,600 training images and 1000 test images (total 646 distinct foreground alpha mattes), which can further improve the robustness of our hierarchical structure aggregation model. Extensive experiments demonstrate that the proposed HAttMatting can capture sophisticated foreground structure and achieve state-of-the-art performance with single RGB images as input.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"10 1","pages":"13673-13682"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78183209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BFBox: Searching Face-Appropriate Backbone and Feature Pyramid Network for Face Detector","authors":"Yang Liu, Xu Tang","doi":"10.1109/cvpr42600.2020.01358","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.01358","url":null,"abstract":"Popular backbones designed on image classification have demonstrated their considerable compatibility on the task of general object detection. However, the same phenomenon does not appear on the face detection. This is largely due to the average scale of ground-truth in the WiderFace dataset is far smaller than that of generic objects in theCOCO one. To resolve this, the success of Neural Archi-tecture Search (NAS) inspires us to search face-appropriate backbone and featrue pyramid network (FPN) architecture.Firstly, we design the search space for backbone and FPN by comparing performance of feature maps with different backbones and excellent FPN architectures on the face detection. Second, we propose a FPN-attention module to joint search the architecture of backbone and FPN. Finally,we conduct comprehensive experiments on popular bench-marks, including Wider Face, FDDB, AFW and PASCALFace, display the superiority of our proposed method.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"26 1","pages":"13565-13574"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77073294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinkyu Kim, Suhong Moon, Anna Rohrbach, Trevor Darrell, J. Canny
{"title":"Advisable Learning for Self-Driving Vehicles by Internalizing Observation-to-Action Rules","authors":"Jinkyu Kim, Suhong Moon, Anna Rohrbach, Trevor Darrell, J. Canny","doi":"10.1109/cvpr42600.2020.00968","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00968","url":null,"abstract":"Humans learn to drive through both practice and theory, e.g. by studying the rules, while most self-driving systems are limited to the former. Being able to incorporate human knowledge of typical causal driving behaviour should benefit autonomous systems. We propose a new approach that learns vehicle control with the help of human advice. Specifically, our system learns to summarize its visual observations in natural language, predict an appropriate action response (e.g. \"I see a pedestrian crossing, so I stop\"), and predict the controls, accordingly. Moreover, to enhance interpretability of our system, we introduce a fine-grained attention mechanism which relies on semantic segmentation and object-centric RoI pooling. We show that our approach of training the autonomous system with human advice, grounded in a rich semantic representation, matches or outperforms prior work in terms of control prediction and explanation generation. Our approach also results in more interpretable visual explanations by visualizing object-centric attention maps. Code is available at https://github.com/JinkyuKimUCB/advisable-driving.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"139 1","pages":"9658-9667"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75942882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Mohan, Nishant Sankaran, Dennis Fedorishin, S. Setlur, V. Govindaraju
{"title":"Moving in the Right Direction: A Regularization for Deep Metric Learning","authors":"D. Mohan, Nishant Sankaran, Dennis Fedorishin, S. Setlur, V. Govindaraju","doi":"10.1109/CVPR42600.2020.01460","DOIUrl":"https://doi.org/10.1109/CVPR42600.2020.01460","url":null,"abstract":"Deep metric learning leverages carefully designed sampling strategies and loss functions that aid in optimizing the generation of a discriminable embedding space. While effective sampling of pairs is critical for shaping the metric space during training, the relative interactions between pairs, and consequently the forces exerted on these pairs that direct their displacement in the embedding space can significantly impact the formation of well separated clusters. In this work, we identify a shortcoming of existing loss formulations which fail to consider more optimal directions of pair displacements as another criterion for optimization. We propose a novel direction regularization to explicitly account for the layout of sampled pairs and attempt to introduce orthogonality in the representations. The proposed regularization is easily integrated into existing loss functions providing considerable performance improvements. We experimentally validate our hypothesis on the Cars-196, CUB-200 and InShop datasets and outperform existing methods to yield state-of-the-art results on these datasets.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"61 1","pages":"14579-14587"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80283128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boundary-Aware 3D Building Reconstruction From a Single Overhead Image","authors":"Jisan Mahmud, True Price, Akash Bapat, Jan-Michael Frahm","doi":"10.1109/cvpr42600.2020.00052","DOIUrl":"https://doi.org/10.1109/cvpr42600.2020.00052","url":null,"abstract":"We propose a boundary-aware multi-task deep-learning-based framework for fast 3D building modeling from a single overhead image. Unlike most existing techniques which rely on multiple images for 3D scene modeling, we seek to model the buildings in the scene from a single overhead image by jointly learning a modified signed distance function (SDF) from the building boundaries, a dense heightmap of the scene, and scene semantics. To jointly train for these tasks, we leverage pixel-wise semantic segmentation and normalized digital surface maps (nDSM) as supervision, in addition to labeled building outlines. At test time, buildings in the scene are automatically modeled in 3D using only an input overhead image. We demonstrate an increase in building modeling performance using a multi-feature network architecture that improves building outline detection by considering network features learned for the other jointly learned tasks. We also introduce a novel mechanism for robustly refining instance-specific building outlines using the learned modified SDF. We verify the effectiveness of our method on multiple large-scale satellite and aerial imagery datasets, where we obtain state-of-the-art performance in the 3D building reconstruction task.","PeriodicalId":6715,"journal":{"name":"2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"22 1","pages":"438-448"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81471331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}