Timur M. Bagautdinov, Chenglei Wu, Jason M. Saragih, P. Fua, Yaser Sheikh
{"title":"Modeling Facial Geometry Using Compositional VAEs","authors":"Timur M. Bagautdinov, Chenglei Wu, Jason M. Saragih, P. Fua, Yaser Sheikh","doi":"10.1109/CVPR.2018.00408","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00408","url":null,"abstract":"We propose a method for learning non-linear face geometry representations using deep generative models. Our model is a variational autoencoder with multiple levels of hidden variables where lower layers capture global geometry and higher ones encode more local deformations. Based on that, we propose a new parameterization of facial geometry that naturally decomposes the structure of the human face into a set of semantically meaningful levels of detail. This parameterization enables us to do model fitting while capturing varying level of detail under different types of geometrical constraints.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"34 1","pages":"3877-3886"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87295144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, Seon Joo Kim
{"title":"Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation","authors":"Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, Seon Joo Kim","doi":"10.1109/CVPR.2018.00340","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00340","url":null,"abstract":"Video super-resolution (VSR) has become even more important recently to provide high resolution (HR) contents for ultra high definition displays. While many deep learning based VSR methods have been proposed, most of them rely heavily on the accuracy of motion estimation and compensation. We introduce a fundamentally different framework for VSR in this paper. We propose a novel end-to-end deep neural network that generates dynamic upsampling filters and a residual image, which are computed depending on the local spatio-temporal neighborhood of each pixel to avoid explicit motion compensation. With our approach, an HR image is reconstructed directly from the input image using the dynamic upsampling filters, and the fine details are added through the computed residual. Our network with the help of a new data augmentation technique can generate much sharper HR videos with temporal consistency, compared with the previous methods. We also provide analysis of our network through extensive experiments to show how the network deals with motions implicitly.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"19 1","pages":"3224-3232"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90162537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Occlusion-Aware Rolling Shutter Rectification of 3D Scenes","authors":"Subeesh Vasu, R. MaheshMohanM., A. Rajagopalan","doi":"10.1109/CVPR.2018.00073","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00073","url":null,"abstract":"A vast majority of contemporary cameras employ rolling shutter (RS) mechanism to capture images. Due to the sequential mechanism, images acquired with a moving camera are subjected to rolling shutter effect which manifests as geometric distortions. In this work, we consider the specific scenario of a fast moving camera wherein the rolling shutter distortions not only are predominant but also become depth-dependent which in turn results in intra-frame occlusions. To this end, we develop a first-of-its-kind pipeline to recover the latent image of a 3D scene from a set of such RS distorted images. The proposed approach sequentially recovers both the camera motion and scene structure while accounting for RS and occlusion effects. Subsequently, we perform depth and occlusion-aware rectification of RS images to yield the desired latent image. Our experiments on synthetic and real image sequences reveal that the proposed approach achieves state-of-the-art results.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"67 1","pages":"636-645"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90374398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Zlateski, Ronnachai Jaroensri, Prafull Sharma, F. Durand
{"title":"On the Importance of Label Quality for Semantic Segmentation","authors":"A. Zlateski, Ronnachai Jaroensri, Prafull Sharma, F. Durand","doi":"10.1109/CVPR.2018.00160","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00160","url":null,"abstract":"Convolutional networks (ConvNets) have become the dominant approach to semantic image segmentation. Producing accurate, pixel-level labels required for this task is a tedious and time consuming process; however, producing approximate, coarse labels could take only a fraction of the time and effort. We investigate the relationship between the quality of labels and the performance of ConvNets for semantic segmentation. We create a very large synthetic dataset with perfectly labeled street view scenes. From these perfect labels, we synthetically coarsen labels with different qualities and estimate human-hours required for producing them. We perform a series of experiments by training ConvNets with a varying number of training images and label quality. We found that the performance of ConvNets mostly depends on the time spent creating the training labels. That is, a larger coarsely-annotated dataset can yield the same performance as a smaller finely-annotated one. Furthermore, fine-tuning coarsely pre-trained ConvNets with few finely-annotated labels can yield comparable or superior performance to training it with a large amount of finely-annotated labels alone, at a fraction of the labeling cost. We demonstrate that our result is also valid for different network architectures, and various object classes in an urban scene.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"6 1","pages":"1479-1487"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73127801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Egocentric Activity Recognition on a Budget","authors":"Rafael Possas, Sheila M. Pinto-Caceres, F. Ramos","doi":"10.1109/CVPR.2018.00625","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00625","url":null,"abstract":"Recent advances in embedded technology have enabled more pervasive machine learning. One of the common applications in this field is Egocentric Activity Recognition (EAR), where users wearing a device such as a smartphone or smartglasses are able to receive feedback from the embedded device. Recent research on activity recognition has mainly focused on improving accuracy by using resource intensive techniques such as multi-stream deep networks. Although this approach has provided state-of-the-art results, in most cases it neglects the natural resource constraints (e.g. battery) of wearable devices. We develop a Reinforcement Learning model-free method to learn energy-aware policies that maximize the use of low-energy cost predictors while keeping competitive accuracy levels. Our results show that a policy trained on an egocentric dataset is able use the synergy between motion and vision sensors to effectively tradeoff energy expenditure and accuracy on smartglasses operating in realistic, real-world conditions.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"20 1","pages":"5967-5976"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73480781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Camera Pose Estimation with Unknown Principal Point","authors":"Viktor Larsson, Z. Kukelova, Yinqiang Zheng","doi":"10.1109/CVPR.2018.00315","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00315","url":null,"abstract":"To estimate the 6-DoF extrinsic pose of a pinhole camera with partially unknown intrinsic parameters is a critical sub-problem in structure-from-motion and camera localization. In most of existing camera pose estimation solvers, the principal point is assumed to be in the image center. Unfortunately, this assumption is not always true, especially for asymmetrically cropped images. In this paper, we develop the first exactly minimal solver for the case of unknown principal point and focal length by using four and a half point correspondences (P4.5Pfuv). We also present an extremely fast solver for the case of unknown aspect ratio (P5Pfuva). The new solvers outperform the previous state-of-the-art in terms of stability and speed. Finally, we explore the extremely challenging case of both unknown principal point and radial distortion, and develop the first practical non-minimal solver by using seven point correspondences (P7Pfruv). Experimental results on both simulated data and real Internet images demonstrate the usefulness of our new solvers.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"26 1","pages":"2984-2992"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74223184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Zhao, Yu Cheng, Yan Xu, Lin Xiong, Jianshu Li, F. Zhao, J. Karlekar, Sugiri Pranata, Shengmei Shen, Junliang Xing, Shuicheng Yan, Jiashi Feng
{"title":"Towards Pose Invariant Face Recognition in the Wild","authors":"Jian Zhao, Yu Cheng, Yan Xu, Lin Xiong, Jianshu Li, F. Zhao, J. Karlekar, Sugiri Pranata, Shengmei Shen, Junliang Xing, Shuicheng Yan, Jiashi Feng","doi":"10.1109/CVPR.2018.00235","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00235","url":null,"abstract":"Pose variation is one key challenge in face recognition. As opposed to current techniques for pose invariant face recognition, which either directly extract pose invariant features for recognition, or first normalize profile face images to frontal pose before feature extraction, we argue that it is more desirable to perform both tasks jointly to allow them to benefit from each other. To this end, we propose a Pose Invariant Model (PIM) for face recognition in the wild, with three distinct novelties. First, PIM is a novel and unified deep architecture, containing a Face Frontalization sub-Net (FFN) and a Discriminative Learning sub-Net (DLN), which are jointly learned from end to end. Second, FFN is a well-designed dual-path Generative Adversarial Network (GAN) which simultaneously perceives global structures and local details, incorporated with an unsupervised cross-domain adversarial training and a \"learning to learn\" strategy for high-fidelity and identity-preserving frontal view synthesis. Third, DLN is a generic Convolutional Neural Network (CNN) for face recognition with our enforced cross-entropy optimization strategy for learning discriminative yet generalized feature representation. Qualitative and quantitative experiments on both controlled and in-the-wild benchmarks demonstrate the superiority of the proposed model over the state-of-the-arts.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"27 1","pages":"2207-2216"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75221520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Statistical Tomography of Microscopic Life","authors":"Aviad Levis, Y. Schechner, R. Talmon","doi":"10.1109/CVPR.2018.00671","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00671","url":null,"abstract":"We achieve tomography of 3D volumetric natural objects, where each projected 2D image corresponds to a different specimen. Each specimen has unknown random 3D orientation, location, and scale. This imaging scenario is relevant to microscopic and mesoscopic organisms, aerosols and hydrosols viewed naturally by a microscope. In-class scale variation inhibits prior single-particle reconstruction methods. We thus generalize tomographic recovery to account for all degrees of freedom of a similarity transformation. This enables geometric self-calibration in imaging of transparent objects. We make the computational load manageable and reach good quality reconstruction in a short time. This enables extraction of statistics that are important for a scientific study of specimen populations, specifically size distribution parameters. We apply the method to study of plankton.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"86 1","pages":"6411-6420"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74607350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Baraldi, Matthijs Douze, R. Cucchiara, H. Jégou
{"title":"LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers","authors":"L. Baraldi, Matthijs Douze, R. Cucchiara, H. Jégou","doi":"10.1109/CVPR.2018.00814","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00814","url":null,"abstract":"This paper considers a learnable approach for comparing and aligning videos. Our architecture builds upon and revisits temporal match kernels within neural networks: we propose a new temporal layer that finds temporal alignments by maximizing the scores between two sequences of vectors, according to a time-sensitive similarity metric parametrized in the Fourier domain. We learn this layer with a temporal proposal strategy, in which we minimize a triplet loss that takes into account both the localization accuracy and the recognition rate. We evaluate our approach on video alignment, copy detection and event retrieval. Our approach outperforms the state on the art on temporal video alignment and video copy detection datasets in comparable setups. It also attains the best reported results for particular event search, while precisely aligning videos.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"10 1","pages":"7804-7813"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77240674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingchao Chen, Yang Liu, Zhaowen Wang, I. Wassell, K. Chetty
{"title":"Re-weighted Adversarial Adaptation Network for Unsupervised Domain Adaptation","authors":"Qingchao Chen, Yang Liu, Zhaowen Wang, I. Wassell, K. Chetty","doi":"10.1109/CVPR.2018.00832","DOIUrl":"https://doi.org/10.1109/CVPR.2018.00832","url":null,"abstract":"Unsupervised Domain Adaptation (UDA) aims to transfer domain knowledge from existing well-defined tasks to new ones where labels are unavailable. In the real-world applications, as the domain (task) discrepancies are usually uncontrollable, it is significantly motivated to match the feature distributions even if the domain discrepancies are disparate. Additionally, as no label is available in the target domain, how to successfully adapt the classifier from the source to the target domain still remains an open question. In this paper, we propose the Re-weighted Adversarial Adaptation Network (RAAN) to reduce the feature distribution divergence and adapt the classifier when domain discrepancies are disparate. Specifically, to alleviate the need of common supports in matching the feature distribution, we choose to minimize optimal transport (OT) based Earth-Mover (EM) distance and reformulate it to a minimax objective function. Utilizing this, RAAN can be trained in an end-to-end and adversarial manner. To further adapt the classifier, we propose to match the label distribution and embed it into the adversarial training. Finally, after extensive evaluation of our method using UDA datasets of varying difficulty, RAAN achieved the state-of-the-art results and outperformed other methods by a large margin when the domain shifts are disparate.","PeriodicalId":6564,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition","volume":"41 1","pages":"7976-7985"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78125182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}