2021 IEEE Winter Conference on Applications of Computer Vision (WACV)最新文献

Have Fun Storming the Castle(s)! 玩得开心攻占城堡!

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2021-01-01 DOI: 10.1109/WACV48630.2021.00375

Connor Anderson, Adam Teuscher, Elizabeth Anderson, Alysia Larsen, Josh Shirley, Ryan Farrell

引用次数: 3

How to Make a BLT Sandwich? Learning VQA towards Understanding Web Instructional Videos 如何制作BLT三明治?学习VQA以理解网络教学视频

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2021-01-01 DOI: 10.1109/WACV48630.2021.00117

Shaojie Wang, Wentian Zhao, Ziyi Kou, Jing Shi, Chenliang Xu

{"title":"How to Make a BLT Sandwich? Learning VQA towards Understanding Web Instructional Videos","authors":"Shaojie Wang, Wentian Zhao, Ziyi Kou, Jing Shi, Chenliang Xu","doi":"10.1109/WACV48630.2021.00117","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00117","url":null,"abstract":"Understanding web instructional videos is an essential branch of video understanding in two aspects. First, most existing video methods focus on short-term actions for a-few-second-long video clips; these methods are not directly applicable to long videos. Second, unlike unconstrained long videos, e.g., movies, instructional videos are more structured in that they have step-by-step procedures constraining the understanding task. In this work, we study problem-solving on instructional videos via Visual Question Answering (VQA). Surprisingly, it has not been an emphasis for the video community despite its rich applications. We thereby introduce YouCookQA, an annotated QA dataset for instructional videos based on YouCook2 [27]. The questions in YouCookQA are not limited to cues on a single frame but relations among multiple frames in the temporal dimension. Observing the lack of effective representations for modeling long videos, we propose a set of carefully designed models including a Recurrent Graph Convolutional Network (RGCN) that captures both temporal order and relational information. Furthermore, we study multiple modalities including descriptions and transcripts for the purpose of boosting video understanding. Extensive experiments on YouCookQA suggest that RGCN performs the best in terms of QA accuracy and better performance is gained by introducing human-annotated descriptions. YouCookQA dataset is available at https://github.com/Jossome/YoucookQA.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127433719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Deflation based Fast and Robust Preconditioner for Bundle Adjustment 一种基于压缩的束调整快速鲁棒预调节器

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2021-01-01 DOI: 10.1109/WACV48630.2021.00182

Shrutimoy Das, Siddhant Katyan, Pawan Kumar

引用次数: 8

Are These from the Same Place? Seeing the Unseen in Cross-View Image Geo-Localization 这些是从同一个地方来的吗?在交叉视点图像地理定位中看不见的东西

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2021-01-01 DOI: 10.1109/WACV48630.2021.00380

Royston Rodrigues, Masahiro Tani

{"title":"Are These from the Same Place? Seeing the Unseen in Cross-View Image Geo-Localization","authors":"Royston Rodrigues, Masahiro Tani","doi":"10.1109/WACV48630.2021.00380","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00380","url":null,"abstract":"In an era where digital maps act as gateways to exploring the world, the availability of large scale geo-tagged imagery has inspired a number of visual navigation techniques. One promising approach to visual navigation is cross-view image geo-localization. Here, the images whose location needs to be determined are matched against a database of geo-tagged aerial imagery. The methods based on this approach sought to resolve view point changes. But scenes also vary temporally, during which new landmarks might appear or existing ones might disappear. One cannot guarantee storage of aerial imagery across all time instants and hence a technique robust to temporal variation in scenes becomes of paramount importance. In this paper, we address the temporal gap between scenes by proposing a two step approach. First, we propose a semantically driven data augmentation technique that gives Siamese networks the ability to hallucinate unseen objects. Then we present the augmented samples to a multi-scale attentive embedding network to perform matching tasks. Experiments on standard benchmarks demonstrate the integration of the proposed approach with existing frameworks improves top-1 image recall rate on the CVUSA data-set from 89.84 % to 93.09 %, and from 81.03 % to 87.21 % on the CVACT data-set.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114394410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Coarse Temporal Attention Network (CTA-Net) for Driver’s Activity Recognition 基于粗糙时间注意力网络(CTA-Net)的驾驶员活动识别

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2021-01-01 DOI: 10.1109/WACV48630.2021.00132

Zachary Wharton, Ardhendu Behera, Yonghuai Liu, Nikolaos Bessis

{"title":"Coarse Temporal Attention Network (CTA-Net) for Driver’s Activity Recognition","authors":"Zachary Wharton, Ardhendu Behera, Yonghuai Liu, Nikolaos Bessis","doi":"10.1109/WACV48630.2021.00132","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00132","url":null,"abstract":"There is significant progress in recognizing traditional human activities from videos focusing on highly distinctive actions involving discriminative body movements, body-object and/or human-human interactions. Driver’s activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes. To address this, we propose a novel framework by exploiting the spatiotemporal attention to model the subtle changes. Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse network. The goal is to allow the glimpse to capture high-level temporal relationships, such as ‘during’, ‘before’ and ‘after’ by focusing on a specific part of a video. These branches also respect the topology of the temporal dynamics in the video, ensuring that different branches learn meaningful spatial and temporal changes. The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition by exploring the hidden states of an LSTM. The attention mechanism helps in learning to decide the importance of each hidden state for the recognition task by weighing them when constructing the representation of the video. Our approach is evaluated on four publicly accessible datasets and significantly outperforms the state-of-the-art by a considerable margin with only RGB video as input.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128663748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Data-free Knowledge Distillation for Object Detection 面向对象检测的无数据知识蒸馏

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2021-01-01 DOI: 10.1109/WACV48630.2021.00333

Akshay Chawla, Hongxu Yin, Pavlo Molchanov, J. Álvarez

{"title":"Data-free Knowledge Distillation for Object Detection","authors":"Akshay Chawla, Hongxu Yin, Pavlo Molchanov, J. Álvarez","doi":"10.1109/WACV48630.2021.00333","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00333","url":null,"abstract":"We present DeepInversion for Object Detection (DIODE) to enable data-free knowledge distillation for neural networks trained on the object detection task. From a data-free perspective, DIODE synthesizes images given only an off-the-shelf pre-trained detection network and without any prior domain knowledge, generator network, or pre-computed activations. DIODE relies on two key components—first, an extensive set of differentiable augmentations to improve image fidelity and distillation effectiveness. Second, a novel automated bounding box and category sampling scheme for image synthesis enabling generating a large number of images with a diverse set of spatial and category objects. The resulting images enable data-free knowledge distillation from a teacher to a student detector, initialized from scratch.In an extensive set of experiments, we demonstrate that DIODE’s ability to match the original training distribution consistently enables more effective knowledge distillation than out-of-distribution proxy datasets, which unavoidably occur in a data-free setup given the absence of the original domain knowledge.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129456350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

The Laughing Machine: Predicting Humor in Video 笑的机器:预测视频中的幽默

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2021-01-01 DOI: 10.1109/WACV48630.2021.00212

Yuta Kayatani, Zekun Yang, Mayu Otani, Noa García, Chenhui Chu, Yuta Nakashima, H. Takemura

引用次数: 8

FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation FairFace:平衡种族、性别和年龄的面部属性数据集，用于偏见测量和缓解

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2021-01-01 DOI: 10.1109/WACV48630.2021.00159

Kimmo Kärkkäinen, Jungseock Joo

{"title":"FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation","authors":"Kimmo Kärkkäinen, Jungseock Joo","doi":"10.1109/WACV48630.2021.00159","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00159","url":null,"abstract":"Existing public face image datasets are strongly biased toward Caucasian faces, and other races (e.g., Latino) are significantly underrepresented. The models trained from such datasets suffer from inconsistent classification accuracy, which limits the applicability of face analytic systems to non-White race groups. To mitigate the race bias problem in these datasets, we constructed a novel face image dataset containing 108,501 images which is balanced on race. We define 7 race groups: White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, and Latino. Images were collected from the YFCC-100M Flickr dataset and labeled with race, gender, and age groups. Evaluations were performed on existing face attribute datasets as well as novel image datasets to measure the generalization performance. We find that the model trained from our dataset is substantially more accurate on novel datasets and the accuracy is consistent across race and gender groups. We also compare several commercial computer vision APIs and report their balanced accuracy across gender, race, and age groups. Our code, data, and models are available at https://github.com/joojs/fairface.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123933686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 359

Spatially Aware Metadata for Raw Reconstruction 用于原始重建的空间感知元数据

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2021-01-01 DOI: 10.1109/WACV48630.2021.00026

Abhijith Punnappurath, M. S. Brown

{"title":"Spatially Aware Metadata for Raw Reconstruction","authors":"Abhijith Punnappurath, M. S. Brown","doi":"10.1109/WACV48630.2021.00026","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00026","url":null,"abstract":"A camera sensor captures a raw-RGB image that is then processed to a standard RGB (sRGB) image through a series of onboard operations performed by the camera’s image signal processor (ISP). Among these processing steps, local tone mapping is one of the most important operations used to enhance the overall appearance of the final rendered sRGB image. For certain applications, it is often desirable to de-render or unprocess the sRGB image back to its original raw-RGB values. This \"raw reconstruction\" is a challenging task because many of the operations performed by the ISP, including local tone mapping, are nonlinear and difficult to invert. Existing raw reconstruction methods that store specialized metadata at capture time to enable raw recovery ignore local tone mapping and assume that a global transformation exists between the raw-RGB and sRGB color spaces. In this work, we advocate a spatially aware metadata-based raw reconstruction method that is robust to local tone mapping, and yields significantly higher raw reconstruction accuracy (6 dB average PSNR improvement) compared to existing raw reconstruction methods. Our method requires only 0.2% samples of the full-sized image as metadata, has negligible computational overhead at capture time, and can be easily integrated into modern ISPs.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123944455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Interpretable and Trustworthy Deepfake Detection via Dynamic Prototypes 基于动态原型的可解释和可信深度伪造检测

2021 IEEE Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2021-01-01 DOI: 10.1109/WACV48630.2021.00202

Loc Trinh, Michael Tsang, Sirisha Rambhatla, Yan Liu

{"title":"Interpretable and Trustworthy Deepfake Detection via Dynamic Prototypes","authors":"Loc Trinh, Michael Tsang, Sirisha Rambhatla, Yan Liu","doi":"10.1109/WACV48630.2021.00202","DOIUrl":"https://doi.org/10.1109/WACV48630.2021.00202","url":null,"abstract":"In this paper we propose a novel human-centered approach for detecting forgery in face images, using dynamic prototypes as a form of visual explanations. Currently, most state-of-the-art deepfake detections are based on black-box models that process videos frame-by-frame for inference, and few closely examine their temporal inconsistencies. However, the existence of such temporal artifacts within deepfake videos is key in detecting and explaining deepfakes to a supervising human. To this end, we propose Dynamic Prototype Network (DPNet) – an interpretable and effective solution that utilizes dynamic representations (i.e., prototypes) to explain deepfake temporal artifacts. Extensive experimental results show that DPNet achieves competitive predictive performance, even on unseen testing datasets such as Google’s DeepFakeDetection, DeeperForensics, and Celeb-DF, while providing easy referential explanations of deepfake dynamics. On top of DPNet’s prototypical framework, we further formulate temporal logic specifications based on these dynamics to check our model’s compliance to desired temporal behaviors, hence providing trustworthiness for such critical detection systems.","PeriodicalId":236300,"journal":{"name":"2021 IEEE Winter Conference on Applications of Computer Vision (WACV)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123727404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47