{"title":"Digging Deeper Into Egocentric Gaze Prediction","authors":"H. R. Tavakoli, Esa Rahtu, Juho Kannala, A. Borji","doi":"10.1109/WACV.2019.00035","DOIUrl":"https://doi.org/10.1109/WACV.2019.00035","url":null,"abstract":"This paper digs deeper into factors that influence egocentric gaze. Instead of training deep models for this purpose in a blind manner, we propose to inspect factors that contribute to gaze guidance during daily tasks. Bottom-up saliency and optical flow are assessed versus strong spatial prior baselines. Task-specific cues such as vanishing point, manipulation point, and hand regions are analyzed as representatives of top-down information. We also look into the contribution of these factors by investigating a simple recurrent neural model for ego-centric gaze prediction. First, deep features are extracted for all input video frames. Then, a gated recurrent unit is employed to integrate information over time and to predict the next fixation. We propose an integrated model that combines the recurrent model with several top-down and bottom-up cues. Extensive experiments over multiple datasets reveal that (1) spatial biases are strong in egocentric videos, (2) bottom-up attention models perform poorly in predicting gaze and underperform spatial biases, (3) deep features perform better compared to traditional features, (4) as opposed to hand regions, the manipulation point is a strong influential cue for gaze prediction, (5) combining the proposed recurrent model with bottom-up cues, vanishing points and, in particular, manipulation point results in the best gaze prediction accuracy over egocentric videos, (6) the knowledge transfer works best for cases where the tasks or sequences are similar, and (7) task and activity recognition can benefit from gaze prediction. Our findings suggest that (1) there should be more emphasis on hand-object interaction and (2) the egocentric vision community should consider larger datasets including diverse stimuli and more subjects.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126496953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing Modern Camera Response Functions","authors":"Can Chen, Scott McCloskey, Jingyi Yu","doi":"10.1109/WACV.2019.00213","DOIUrl":"https://doi.org/10.1109/WACV.2019.00213","url":null,"abstract":"Camera Response Functions (CRFs) map the irradiance incident at a sensor pixel to an intensity value in the corresponding image pixel. The nonlinearity of CRFs impact physics-based and low-level computer vision methods like de-blurring, photometric stereo, etc. In addition, CRFs have been used for forensics to identify regions of an image spliced in from a different camera. Despite its importance, the process of radiometrically calibrating a camera's CRF is significantly harder and less standardized than geometric calibration. Competing methods use different mathematical models of the CRF, some of which are derived from an outdated dataset. We present a new dataset of 178 CRFs from modern digital cameras, derived from 1565 camera review images available online, and use it to answer a series of questions about CRFs. Which mathematical models are best for CRF estimation? How have they changed over time? And how unique are CRFs from camera to camera?","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133253450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianshu Li, Pan Zhou, Yunpeng Chen, Jian Zhao, S. Roy, Shuicheng Yan, Jiashi Feng, T. Sim
{"title":"Task Relation Networks","authors":"Jianshu Li, Pan Zhou, Yunpeng Chen, Jian Zhao, S. Roy, Shuicheng Yan, Jiashi Feng, T. Sim","doi":"10.1109/WACV.2019.00104","DOIUrl":"https://doi.org/10.1109/WACV.2019.00104","url":null,"abstract":"Multi-task learning is popular in machine learning and computer vision. In multitask learning, properly modeling task relations is important for boosting the performance of jointly learned tasks. Task covariance modeling has been successfully used to model the relations of tasks but is limited to homogeneous multi-task learning. In this paper, we propose a feature based task relation modeling approach, suitable for both homogeneous and heterogeneous multi-task learning. First, we propose a new metric to quantify the relations between tasks. Based on the quantitative metric, we then develop the task relation layer, which can be combined with any deep learning architecture to form task relation networks to fully exploit the relations of different tasks in an online fashion. Benefiting from the task relation layer, the task relation networks can better leverage the mutual information from the data. We demonstrate our proposed task relation networks are effective in improving the performance in both homogeneous and heterogeneous multi-task learning settings through extensive experiments on computer vision tasks.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133571948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Md Baharul Islam, L. Wong, Kok-Lim Low, Chee-Onn Wong
{"title":"Warping-Based Stereoscopic 3D Video Retargeting With Depth Remapping","authors":"Md Baharul Islam, L. Wong, Kok-Lim Low, Chee-Onn Wong","doi":"10.1109/WACV.2019.00181","DOIUrl":"https://doi.org/10.1109/WACV.2019.00181","url":null,"abstract":"Due to the recent availability of different stereoscopic display devices and online 3D media resources (e.g. 3D movies), there is a growing demand for stereoscopic video retargeting that can automatically resize a given stereoscopic video to fit the target display device. In this paper, we propose a warping-based approach that can simultaneously resize and remap the depth of a stereoscopic video to produce a better 3D viewing experience. Firstly, our method computes the significance map for each stereo video frame. It then performs volume warping using non-homogeneous scaling optimization to resize the stereoscopic video. A depth remapping constraint is used to remap the depth and a constraint is applied to preserve the significant content during warping process. Experimental results demonstrate the effectiveness of our method in preserving the significant content, ensuring motion consistency, and enhancing the depth perception of the retargeted video sequences within the comfort depth range.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133891801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Hossain, M. Hosseinzadeh, Omit Chanda, Yang Wang
{"title":"Crowd Counting Using Scale-Aware Attention Networks","authors":"M. Hossain, M. Hosseinzadeh, Omit Chanda, Yang Wang","doi":"10.1109/WACV.2019.00141","DOIUrl":"https://doi.org/10.1109/WACV.2019.00141","url":null,"abstract":"In this paper, we consider the problem of crowd counting in images. Given an image of a crowded scene, our goal is to estimate the density map of this image, where each pixel value in the density map corresponds to the crowd density at the corresponding location in the image. Given the estimated density map, the final crowd count can be obtained by summing over all values in the density map. One challenge of crowd counting is the scale variation in images. In this work, we propose a novel scale-aware attention network to address this challenge. Using the attention mechanism popular in recent deep learning architectures, our model can automatically focus on certain global and local scales appropriate for the image. By combining these global and local scale attentions, our model outperforms other state-of-the-art methods for crowd counting on several benchmark datasets.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133082404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"3D Reconstruction and Texture Optimization Using a Sparse Set of RGB-D Cameras","authors":"Wei Li, Xiao Xiao, J. Hahn","doi":"10.1109/WACV.2019.00155","DOIUrl":"https://doi.org/10.1109/WACV.2019.00155","url":null,"abstract":"We contribute a new integrated system designed for high-quality 3D reconstructions. The system consists of a sparse set of commodity RGB-D cameras, which allows for fast and accurate scan of objects with multi-view inputs. We propose a robust and efficient tile-based streaming pipeline for geometry reconstruction with TSDF fusion which minimizes memory overhead and calculation cost. Our multi-grid warping method for texture optimization can address misalignments of both global structures and small details due to the errors in multi-camera registration, optical distortions and imprecise geometries. In addition, we apply a global color correction method to reduce color inconsistency among RGB images caused by variations of camera settings. Finally, we demonstrate the effectiveness of our proposed system with detailed experiments of multi-view datasets.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114762814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rodrigo de Bem, Arna Ghosh, A. Boukhayma, Thalaiyasingam Ajanthan, N. Siddharth, Philip H. S. Torr
{"title":"A Conditional Deep Generative Model of People in Natural Images","authors":"Rodrigo de Bem, Arna Ghosh, A. Boukhayma, Thalaiyasingam Ajanthan, N. Siddharth, Philip H. S. Torr","doi":"10.1109/WACV.2019.00159","DOIUrl":"https://doi.org/10.1109/WACV.2019.00159","url":null,"abstract":"We propose a deep generative model of humans in natural images which keeps 2D pose separated from other latent factors of variation, such as background scene and clothing. In contrast to methods that learn generative models of low-dimensional representations, e.g., segmentation masks and 2D skeletons, our single-stage end-to-end conditional-VAEGAN learns directly on the image space. The flexibility of this approach allows the sampling of people with independent variations of pose and appearance. Moreover, it enables the reconstruction of images conditioned to a given posture, allowing, for instance, pose-transfer from one person to another. We validate our method on the Human3.6M dataset and achieve state-of-the-art results on the ChictopiaPlus benchmark. Our model, named Conditional-DGPose, outperforms the closest related work in the literature. It generates more realistic and accurate images regarding both, body posture and image quality, learning the underlying factors of pose and appearance variation.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116066584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Diversity of Image Captioning Through Variational Autoencoders and Adversarial Learning","authors":"Li Ren, Guo-Jun Qi, K. Hua","doi":"10.1109/WACV.2019.00034","DOIUrl":"https://doi.org/10.1109/WACV.2019.00034","url":null,"abstract":"Learning translation from images to human-readable natural language has become a great challenge in computer vision research in recent years. Existing works explore the semantic correlation between the visual and language domains via encoder-to-decoder learning frameworks based on classifying visual features in the language domain. This approach, however, is criticized for its lacking of naturalness and diversity. In this paper, we demonstrate a novel way to learn a semantic connection between visual information and natural language directly based on a Variational Autoencoder (VAE) that is trained in an adversarial routine. Instead of using the classification based discriminator, our method directly learns to estimate the diversity between a hidden vector embedded from a text encoder and an informative feature that is sampled from a learned distribution of the autoencoders. We show that the sentences learned from this matching contains accurate semantic meaning with high diversity in the image captioning task. Our experiments on the popular MSCOCO dataset indicates that our method learns to generate high-quality natural language with competitive scores on both correctness and diversity.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116169966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Skip Residual Pairwise Networks With Learnable Comparative Functions for Few-Shot Learning","authors":"A. Mehrotra, Ambedkar Dukkipati","doi":"10.1109/WACV.2019.00099","DOIUrl":"https://doi.org/10.1109/WACV.2019.00099","url":null,"abstract":"In this work we consider the ubiquitous Siamese network architecture and hypothesize that having an end-to-end learnable comparative function instead of an arbitrarily fixed one used commonly in practice (such as dot product) would allow the network to learn a final representation more suited to the task at hand and generalize better with very small quantities of data. Based on this we propose Skip Residual Pairwise Networks (SRPN) for few-shot learning based on residual Siamese networks. We validate our hypothesis by evaluating the proposed model for few-shot learning on Omniglot and mini-Imagenet datasets. Our model outperforms the residual Siamese design of equal depth and parameters. We also show that our model is competitive with state-of-the-art meta-learning based methods for few-shot learning on the challenging mini-Imagenet dataset whilst being a much simpler design, obtaining 54.4% accuracy on the five-way few-shot learning task with only a single example per class and over 70% accuracy with five examples per class. We further observe that the network weights in our model are much smaller compared to an equivalent residual Siamese Network under similar regularization, thus validating our hypothesis that our model design allows for better generalization. We also observe that our asymmetric, non-metric SRPN design automatically learns to approximate natural metric learning priors such as a symmetry and the triangle inequality.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121469679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross Domain Residual Transfer Learning for Person Re-Identification","authors":"Furqan Khan, F. Brémond","doi":"10.1109/WACV.2019.00219","DOIUrl":"https://doi.org/10.1109/WACV.2019.00219","url":null,"abstract":"This paper presents a novel way to transfer model weights from one domain to another using residual learning framework instead of direct fine-tuning. It also argues for hybrid models that use learned (deep) features and statistical metric learning for multi-shot person re-identification when training sets are small. This is in contrast to popular end-to-end neural network based models or models that use hand-crafted features with adaptive matching models (neural nets or statistical metrics). Our experiments demonstrate that a hybrid model with residual transfer learning can yield significantly better re-identification performance than an end-to-end model when training set is small. On iLIDS-VID and PRID datasets, we achieve rank-1 recognition rates of 89.8% and 95%, respectively, which is a significant improvement over state-of-the-art.","PeriodicalId":436637,"journal":{"name":"2019 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122687980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}