{"title":"Multi-frame Recurrent Adversarial Network for Moving Object Segmentation","authors":"Prashant W. Patil, Akshay Dudhane, S. Murala","doi":"10.1109/WACV48630.2021.00235","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00235","url":null,"abstract":"Moving object segmentation (MOS) in different practical scenarios like weather degraded, dynamic background, etc. videos is a challenging and high demanding task for various computer vision applications. Existing supervised approaches achieve remarkable performance with complicated training or extensive fine-tuning or inappropriate training-testing data distribution. Also, the generalized effect of existing works with completely unseen data is difficult to identify. In this work, the recurrent feature sharing based generative adversarial network is proposed with unseen video analysis. The proposed network comprises of dilated convolution to extract the spatial features at multiple scales. Along with the temporally sampled multiple frames, previous frame output is considered as input to the network. As the motion is very minute between the two consecutive frames, the previous frame decoder features are shared with encoder features recurrently for current frame foreground segmentation. This recurrent feature sharing of different layers helps the encoder network to learn the hierarchical interactions between the motion and appearance-based features. Also, the learning of the proposed network is concentrated in different ways, like disjoint and global training-testing for MOS. An extensive experimental analysis of the proposed network is carried out on two benchmark video datasets with seen and unseen MOS video. Qualitative and quantitative experimental study shows that the proposed network outperforms the existing methods.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127865289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youngrock Oh, Hyungsik Jung, Jeonghyung Park, Min Soo Kim
{"title":"EVET: Enhancing Visual Explanations of Deep Neural Networks Using Image Transformations","authors":"Youngrock Oh, Hyungsik Jung, Jeonghyung Park, Min Soo Kim","doi":"10.1109/WACV48630.2021.00362","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00362","url":null,"abstract":"Numerous interpretability methods have been developed to visually explain the behavior of complex machine learning models by estimating parts of the input image that are critical for the model’s prediction. We propose a general pipeline of enhancing visual explanations using image transformations (EVET). EVET considers transformations of the original input image to refine the critical input region based on an intuitive rationale that the region estimated to be important in variously transformed inputs is more important. Our proposed EVET is applicable to existing visual explanation methods without modification. We validate the effectiveness of the proposed method qualitatively and quantitatively to show that the resulting explanation method outperforms the original in terms of faithfulness, localization, and stability. We also demonstrate that EVET can be used to achieve desirable performance with a low computational cost. For example, EVET-applied Grad-CAM achieves performance comparable to Score-CAM, which is the state-of-the-art activation-based explanation method, while reducing execution time by more than 90% on VOC, COCO, and ImageNet.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125377599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Video Captioning of Future Frames","authors":"M. Hosseinzadeh, Yang Wang","doi":"10.1109/WACV48630.2021.00102","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00102","url":null,"abstract":"Being able to anticipate and describe what may happen in the future is a fundamental ability for humans. Given a short clip of a scene about \"a person is sitting behind a piano\", humans can describe what will happen afterward, i.e. \"the person is playing the piano\". In this paper, we consider the task of captioning future events to assess the performance of intelligent models on anticipation and video description generation tasks simultaneously. More specifically, given only the frames relating to an occurring event (activity), the goal is to generate a sentence describing the most likely next event in the video. We tackle the problem by first predicting the next event in the semantic space of convolutional features, then fusing contextual information into those features, and feeding them to a captioning module. Departing from using recurrent units allows us to train the network in parallel. We compare the proposed method with a baseline and an oracle method on the ActivityNet-Captions dataset. Experimental results demonstrate that the proposed method outperforms the baseline and is comparable to the oracle method. We perform additional ablation study to further analyze our approach.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"289 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121330511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohamed Chaabane, L. Gueguen, A. Trabelsi, J. Beveridge, Stephen O'Hara
{"title":"End-to-end Learning Improves Static Object Geo-localization from Video","authors":"Mohamed Chaabane, L. Gueguen, A. Trabelsi, J. Beveridge, Stephen O'Hara","doi":"10.1109/WACV48630.2021.00211","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00211","url":null,"abstract":"Accurately estimating the position of static objects, such as traffic lights, from the moving camera of a self-driving car is a challenging problem. In this work, we present a system that improves the localization of static objects by jointly-optimizing the components of the system via learning. Our system is comprised of networks that perform: 1) 5DoF object pose estimation from a single image, 2) association of objects between pairs of frames, and 3) multi-object tracking to produce the final geo-localization of the static objects within the scene. We evaluate our approach using a publicly-available data set, focusing on traffic lights due to data availability. For each component, we compare against contemporary alternatives and show significantly-improved performance. We also show that the end-to-end system performance is further improved via joint-training of the constituent models. Code is available at: https://github.com/MedChaabane/Static_Objects_Geolocalization.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115265840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michal Uřičář, Ganesh Sistu, Hazem Rashed, Antonín Vobecký, V. Kumar, P. Krízek, Fabian Burger, S. Yogamani
{"title":"Let’s Get Dirty: GAN Based Data Augmentation for Camera Lens Soiling Detection in Autonomous Driving","authors":"Michal Uřičář, Ganesh Sistu, Hazem Rashed, Antonín Vobecký, V. Kumar, P. Krízek, Fabian Burger, S. Yogamani","doi":"10.1109/WACV48630.2021.00081","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00081","url":null,"abstract":"Wide-angle fisheye cameras are commonly used in automated driving for parking and low-speed navigation tasks. Four of such cameras form a surround-view system that provides a complete and detailed view of the vehicle. These cameras are directly exposed to harsh environmental settings and can get soiled very easily by mud, dust, water, frost. Soiling on the camera lens can severely degrade the visual perception algorithms, and a camera cleaning system triggered by a soiling detection algorithm is increasingly being deployed. While adverse weather conditions, such as rain, are getting attention recently, there is only limited work on general soiling. The main reason is the difficulty in collecting a diverse dataset as it is a relatively rare event.We propose a novel GAN based algorithm for generating unseen patterns of soiled images. Additionally, the proposed method automatically provides the corresponding soiling masks eliminating the manual annotation cost. Augmentation of the generated soiled images for training improves the accuracy of soiling detection tasks significantly by 18% demonstrating its usefulness. The manually annotated soiling dataset and the generated augmentation dataset will be made public. We demonstrate the generalization of our fisheye trained GAN model on the Cityscapes dataset. We provide an empirical evaluation of the degradation of the semantic segmentation algorithm with the soiled data.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115171293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mohammad Saeed Rad, Thomas Yu, C. Musat, H. K. Ekenel, B. Bozorgtabar, J. Thiran
{"title":"Benefiting from Bicubically Down-Sampled Images for Learning Real-World Image Super-Resolution","authors":"Mohammad Saeed Rad, Thomas Yu, C. Musat, H. K. Ekenel, B. Bozorgtabar, J. Thiran","doi":"10.1109/WACV48630.2021.00163","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00163","url":null,"abstract":"Super-resolution (SR) has traditionally been based on pairs of high-resolution images (HR) and their low-resolution (LR) counterparts obtained artificially with bicubic downsampling. However, in real-world SR, there is a large variety of realistic image degradations and analytically modeling these realistic degradations can prove quite difficult. In this work, we propose to handle real-world SR by splitting this ill-posed problem into two comparatively more well-posed steps. First, we train a network to transform real LR images to the space of bicubically down-sampled images in a supervised manner, by using both real LR/HR pairs and synthetic pairs. Second, we take a generic SR network trained on bicubically downsampled images to super-resolve the transformed LR image. The first step of the pipeline addresses the problem by registering the large variety of degraded images to a common, well understood space of images. The second step then leverages the already impressive performance of SR on bicubically downsampled images, sidestepping the issues of end-to-end training on datasets with many different image degradations. We demonstrate the effectiveness of our proposed method by comparing it to recent methods in real-world SR and show that our proposed approach outperforms the state-of-the-art works in terms of both qualitative and quantitative results, as well as results of an extensive user study conducted on several real image datasets.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133343524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Line Art Correlation Matching Feature Transfer Network for Automatic Animation Colorization","authors":"Qian Zhang, Bo Wang, W. Wen, Hai Li, Junhui Liu","doi":"10.1109/WACV48630.2021.00392","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00392","url":null,"abstract":"Automatic animation line art colorization is a challenging computer vision problem, since the information of the line art is highly sparse and abstracted and there exists a strict requirement for the color and style consistency between frames. Recently, a lot of Generative Adversarial Network (GAN) based image-to-image translation methods for single line art colorization have emerged. They can generate perceptually appealing results conditioned on line art images. However, these methods can not be adopted for the purpose of animation colorization because there is a lack of consideration of the in-between frame consistency. Existing methods simply input the previous colored frame as a reference to color the next line art, which will mislead the colorization due to the spatial misalignment of the previous colored frame and the next line art especially at positions where apparent changes happen. To address these challenges, we design a kind of correlation matching feature transfer model (called CMFT) to align the colored reference feature in a learnable way and integrate the model into an U-Net based generator in a coarse-to-fine manner This enables the generator to transfer the layer-wise synchronized features from the deep semantic code to the content progressively. Extension evaluation shows that CMFT model can effectively improve the in-between consistency and the quality of colored frames especially when the motion is intense and diverse.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134017752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multi-Class Hinge Loss for Conditional GANs","authors":"Ilya Kavalerov, W. Czaja","doi":"10.1109/WACV48630.2021.00133","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00133","url":null,"abstract":"We propose a new algorithm to incorporate class conditional information into the critic of GANs via a multi-class generalization of the commonly used Hinge loss that is compatible with both supervised and semi-supervised settings. We study the compromise between training a state of the art generator and an accurate classifier simultaneously, and propose a way to use our algorithm to measure the degree to which a generator and critic are class conditional. We show the trade-off between a generator-critic pair respecting class conditioning inputs and generating the highest quality images. With our multi-hinge loss modification we are able to improve Inception Scores and Frechet Inception Distance on the Imagenet dataset.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"4 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114133546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Domenick Poster, Matthew D. Thielke, R. Nguyen, Srinivasan Rajaraman, Xing Di, Cedric Nimpa Fondje, Vishal M. Patel, Nathan J. Short, B. Riggan, N. Nasrabadi, Shuowen Hu
{"title":"A Large-Scale, Time-Synchronized Visible and Thermal Face Dataset","authors":"Domenick Poster, Matthew D. Thielke, R. Nguyen, Srinivasan Rajaraman, Xing Di, Cedric Nimpa Fondje, Vishal M. Patel, Nathan J. Short, B. Riggan, N. Nasrabadi, Shuowen Hu","doi":"10.1109/WACV48630.2021.00160","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00160","url":null,"abstract":"Thermal face imagery, which captures the naturally emitted heat from the face, is limited in availability compared to face imagery in the visible spectrum. To help address this scarcity of thermal face imagery for research and algorithm development, we present the DEVCOM Army Research Laboratory Visible-Thermal Face Dataset (ARL-VTF). With over 500,000 images from 395 subjects, the ARL-VTF dataset represents, to the best of our knowledge, the largest collection of paired visible and thermal face images to date. The data was captured using a modern long wave infrared (LWIR) camera mounted alongside a stereo setup of three visible spectrum cameras. Variability in expressions, pose, and eyewear has been systematically recorded. The dataset has been curated with extensive annotations, metadata, and standardized protocols for evaluation. Furthermore, this paper presents extensive benchmark results and analysis on thermal face landmark detection and thermal-to-visible face verification by evaluating state-of-the-art models on the ARL-VTF dataset.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114545265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DualSANet: Dual Spatial Attention Network for Iris Recognition","authors":"Kai Yang, Zihao Xu, Jingjing Fei","doi":"10.1109/WACV48630.2021.00093","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00093","url":null,"abstract":"Compared with other human biosignatures, iris has more advantages on accuracy, invariability and robustness. However, the performance of existing common iris recognition algorithms is still far from expectations of the community. Although some researchers have attempted to uti-lize deep learning methods which are superior to traditional methods, it is worth exploring better CNN network architecture. In this paper, we propose a novel network architecture based on the dual spatial attention mechanism for iris recognition, called DualSANet. Specifically, the proposed architecture can generate multi-level spatially corresponding feature representations via an encoder-decoder structure. In the meantime, we also propose a new spatial attention feature fusion module, so as to ensemble these features more effectively. Based on these, our architecture can generate dual feature representations which have complementary discriminative information. Extensive experiments are conducted on CASIA-IrisV4-Thousand, CASIA-IrisV4-Distance, and IITD datasets. The experimental results show that our method achieves superior performance compared with the state-of-the-arts.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115765964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}