Aditya Chattopadhyay, Xi Zhang, D. Wipf, H. Arora, Rene Vidal
{"title":"Learning Graph Variational Autoencoders with Constraints and Structured Priors for Conditional Indoor 3D Scene Generation","authors":"Aditya Chattopadhyay, Xi Zhang, D. Wipf, H. Arora, Rene Vidal","doi":"10.1109/WACV56688.2023.00085","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00085","url":null,"abstract":"We present a graph variational autoencoder with a structured prior for generating the layout of indoor 3D scenes. Given the room type (e.g., living room or library) and the room layout (e.g., room elements such as floor and walls), our architecture generates a collection of objects (e.g., furniture items such as sofa, table and chairs) that is consistent with the room type and layout. This is a challenging problem because the generated scene needs to satisfy multiple constrains, e.g., each object should lie inside the room and two objects should not occupy the same volume. To address these challenges, we propose a deep generative model that encodes these relationships as soft constraints on an attributed graph (e.g., the nodes capture attributes of room and furniture elements, such as shape, class, pose and size, and the edges capture geometric relationships such as relative orientation). The architecture consists of a graph encoder that maps the input graph to a structured latent space, and a graph decoder that generates a furniture graph, given a latent code and the room graph. The latent space is modeled with autoregressive priors, which facilitates the generation of highly structured scenes. We also propose an efficient training procedure that combines matching and constrained learning. Experiments on the 3D-FRONT dataset show that our method produces scenes that are diverse and are adapted to the room layout.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"228 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114249831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CNN2Graph: Building Graphs for Image Classification","authors":"Vivek Trivedy, Longin Jan Latecki","doi":"10.1109/WACV56688.2023.00009","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00009","url":null,"abstract":"Neural Network classifiers generally operate via the i.i.d. assumption where examples are passed through independently during training. We propose CNN2GNN and CNN2Transformer which instead leverage inter-example information for classification. We use Graph Neural Networks (GNNs) to build a latent space bipartite graph and compute cross-attention scores between input images and a proxy set. Our approach addresses several challenges of existing methods. Firstly, it is end-to-end differentiable despite the generally discrete nature of graph construction. Secondly, it allows inductive inference at no extra cost. Thirdly, it presents a simple method to construct graphs from arbitrary datasets that captures both example level and class level information. Finally, it addresses the proxy collapse problem by combining contrastive and cross-entropy losses rather than separate clustering algorithms. Our results increase classification performance over baseline experiments and outperform other methods. We also conduct an empirical investigation showing that Transformer style attention scales better than GAT attention with dataset size.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"122 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129503862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GEMS: Generating Efficient Meta-Subnets","authors":"Varad Pimpalkhute, Shruti Kunde, Rekha Singhal","doi":"10.1109/WACV56688.2023.00528","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00528","url":null,"abstract":"Gradient-based meta learners (GBML) such as MAML [6] aim to learn a model initialization across similar tasks, such that the model generalizes well on unseen tasks sampled from the same distribution with few gradient updates. A limitation of GBML is its inability to adapt to real-world applications where input tasks are sampled from multiple distributions. An existing effort [23] learns ${mathcal{N}}$ initializations for tasks sampled from ${mathcal{N}}$ distributions; roughly increasing training time by a factor of ${mathcal{N}}$. Instead, we use a single model initialization to learn distribution-specific parameters for every input task. This reduces negative knowledge transfer across distributions and overall computational cost. Specifically, we explore two ways of efficiently learning on multi-distribution tasks: 1) Binary Mask Perceptron (BMP) which learns distribution-specific layers, 2) Multi-modal Supermask (MMSUP) which learns distribution-specific parameters. We evaluate the performance of the proposed framework (GEMS) on few-shot vision classification tasks. The experimental results demonstrate an improvement in accuracy and a speed-up of ~2× to 4× in the training time, over existing state of the art algorithms on quasi-benchmark datasets in the field of meta-learning.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129718358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stavros Tsogkas, Feng Zhang, A. Jepson, Alex Levinshtein
{"title":"Efficient Flow-Guided Multi-frame De-fencing","authors":"Stavros Tsogkas, Feng Zhang, A. Jepson, Alex Levinshtein","doi":"10.1109/WACV56688.2023.00188","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00188","url":null,"abstract":"Taking photographs \"in-the-wild\" is often hindered by fence obstructions that stand between the camera user and the scene of interest, and which are hard or impossible to avoid. De-fencing is the algorithmic process of automatically removing such obstructions from images, revealing the invisible parts of the scene. While this problem can be formulated as a combination of fence segmentation and image inpainting, this often leads to implausible hallucinations of the occluded regions. Existing multi-frame approaches rely on propagating information to a selected keyframe from its temporal neighbors, but they are often inefficient and struggle with alignment of severely obstructed images. In this work we draw inspiration from the video completion literature, and develop a simplified framework for multi-frame de-fencing that computes high quality flow maps directly from obstructed frames, and uses them to accurately align frames. Our primary focus is efficiency and practicality in a real world setting: the input to our algorithm is a short image burst (5 frames) – a data modality commonly available in modern smartphones– and the output is a single reconstructed keyframe, with the fence removed. Our approach leverages simple yet effective CNN modules, trained on carefully generated synthetic data, and outperforms more complicated alternatives real bursts, both quantitatively and qualitatively, while running real-time.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129951892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Zhao, Bo Liu, Ming Ding, Baoping Liu, Tianqing Zhu, Xin Yu
{"title":"Proactive Deepfake Defence via Identity Watermarking","authors":"Yuan Zhao, Bo Liu, Ming Ding, Baoping Liu, Tianqing Zhu, Xin Yu","doi":"10.1109/WACV56688.2023.00458","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00458","url":null,"abstract":"The explosive progress of Deepfake techniques poses unprecedented privacy and security risks to our society by creating real-looking but fake visual content. The current Deepfake detection studies are still in their infancy because they mainly rely on capturing artifacts left by a Deepfake synthesis process as detection clues, which can be easily removed by various distortions (e.g. blurring) or advanced Deepfake techniques. In this paper, we propose a novel method that does not depend on identifying the artifacts but resorts to the mechanism of anti-counterfeit labels to protect face images from malicious Deepfake tampering. Specifically, we design a neural network with an encoder-decoder structure to embed watermarks as anti-Deepfake labels into the facial identity features. The injected label is entangled with the facial identity feature, so it will be sensitive to face swap translations (i.e., Deepfake) and robust to conventional image modifications (e.g., resize and compress). Therefore, we can identify whether watermarked images have been tampered with by Deepfake methods according to the label’s existence. Experimental results demonstrate that our method can achieve average detection accuracy of more than 80%, which validates the proposed method’s effectiveness in implementing Deepfake detection.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128325617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MFCFlow: A Motion Feature Compensated Multi-Frame Recurrent Network for Optical Flow Estimation","authors":"Yonghu Chen, Dongchen Zhu, Wenjun Shi, Guanghui Zhang, Tianyu Zhang, Xiaolin Zhang, Jiamao Li","doi":"10.1109/WACV56688.2023.00504","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00504","url":null,"abstract":"Occlusions have long been a hard nut to crack in optical flow estimation due to ambiguous pixels matching between abutting images. Current methods only take two consecutive images as input, which is challenging to capture temporal coherence and reason about occluded regions. In this paper, we propose a novel optical flow estimation framework, namely MFCFlow, which attempts to compensate for the information of occlusions by mining and transferring motion features between multiple frames. Specifically, we construct a Motion-guided Feature Compensation cell (MFC cell) to enhance the ambiguous motion features according to the correlation of previous features obtained by attention-based structure. Furthermore, a TopK attention strategy is developed and embedded into the MFC cell to improve the subsequent matching quality. Extensive experiments demonstrate that our MFCFlow achieves significant improvements in occluded regions and attains state-of-the-art performances on both Sintel and KITTI benchmarks among other multi-frame optical flow methods.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129256681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cheng-Yen Yang, Jiajia Luo, Lu Xia, Yuyin Sun, Nan Qiao, Ke Zhang, Zhongyu Jiang, Jenq-Neng Hwang
{"title":"CameraPose: Weakly-Supervised Monocular 3D Human Pose Estimation by Leveraging In-the-wild 2D Annotations","authors":"Cheng-Yen Yang, Jiajia Luo, Lu Xia, Yuyin Sun, Nan Qiao, Ke Zhang, Zhongyu Jiang, Jenq-Neng Hwang","doi":"10.1109/WACV56688.2023.00294","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00294","url":null,"abstract":"To improve the generalization of 3D human pose estimators, many existing deep learning based models focus on adding different augmentations to training poses. However, data augmentation techniques are limited to the \"seen\" pose combinations and hard to infer poses with rare \"unseen\" joint positions. To address this problem, we present CameraPose, a weakly-supervised framework for 3D human pose estimation from a single image, which can not only be applied on 2D-3D pose pairs but also on 2D alone annotations. By adding a camera parameter branch, any in-the-wild 2D annotations can be fed into our pipeline to boost the training diversity and the 3D poses can be implicitly learned by reprojecting back to 2D. Moreover, CameraPose introduces a refinement network module with confidence-guided loss to further improve the quality of noisy 2D keypoints extracted by 2D pose estimators. Experimental results demonstrate that the CameraPose brings in clear improvements on cross-scenario datasets. Notably, it outperforms the baseline method by 3mm on the most challenging dataset 3DPW. In addition, by combining our proposed refinement network module with existing 3D pose estimators, their performance can be improved in cross-scenario evaluation.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130384429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Avoiding Lingering in Learning Active Recognition by Adversarial Disturbance","authors":"Lei Fan, Ying Wu","doi":"10.1109/WACV56688.2023.00459","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00459","url":null,"abstract":"This paper considers the active recognition scenario, where the agent is empowered to intelligently acquire observations for better recognition. The agents usually compose two modules, i.e., the policy and the recognizer, to select actions and predict the category. While using ground-truth class labels to supervise the recognizer, the policy is typically updated with rewards determined by the current in-training recognizer, like whether achieving correct predictions. However, this joint learning process could lead to unintended solutions, like a collapsed policy that only visits views that the recognizer is already sufficiently trained to obtain rewards, which harms the generalization ability. We call this phenomenon lingering to depict the agent being reluctant to explore challenging views during training. Existing approaches to tackle the exploration-exploitation trade-off could be ineffective as they usually assume reliable feedback during exploration to update the estimate of rarely-visited states. This assumption is invalid here as the reward from the recognizer could be insufficiently trained.To this end, our approach integrates another adversarial policy to constantly disturb the recognition agent during training, forming a competing game to promote active explorations and avoid lingering. The reinforced adversary, rewarded when the recognition fails, contests the recognition agent by turning the camera to challenging observations. Extensive experiments across two datasets validate the effectiveness of the proposed approach regarding its recognition performances, learning efficiencies, and especially robustness in managing environmental noises.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"22 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123495180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yasheng Sun, Jiangke Lin, Hang Zhou, Zhi-liang Xu, Dongliang He, H. Koike
{"title":"ReEnFP: Detail-Preserving Face Reconstruction by Encoding Facial Priors","authors":"Yasheng Sun, Jiangke Lin, Hang Zhou, Zhi-liang Xu, Dongliang He, H. Koike","doi":"10.1109/WACV56688.2023.00606","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00606","url":null,"abstract":"We address the problem of face modeling, which is still challenging in achieving high-quality reconstruction results efficiently. Neither previous regression-based nor optimization-based frameworks could well balance between the facial reconstruction fidelity and efficiency. We notice that the large amount of in-the-wild facial images contain diverse appearance information, however, their underlying knowledge is not fully exploited for face modeling. To this end, we propose our Reconstruction by Encoding Facial Priors (ReEnFP) pipeline to exploit the potential of unconstrained facial images for further improvement. Our key is to encode generative priors learned by a style-based texture generator on unconstrained data for fast and detail-preserving face reconstruction. With our texture generator pre-trained using a differentiable renderer, faces could be encoded to its latent space as opposed to the time-consuming optimization-based inversion. Our generative prior encoding is further enhanced with a pyramid fusion block for adaptive integration of input spatial information. Extensive experiments show that our method reconstructs photo-realistic facial textures and geometric details with precise identity recovery.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123640638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Split to Learn: Gradient Split for Multi-Task Human Image Analysis","authors":"Weijian Deng, Yumin Suh, Xiang Yu, M. Faraki, Liang Zheng, Manmohan Chandraker","doi":"10.1109/WACV56688.2023.00433","DOIUrl":"https://doi.org/10.1109/WACV56688.2023.00433","url":null,"abstract":"This paper presents an approach to train a unified deep network that simultaneously solves multiple human-related tasks. A multi-task framework is favorable for sharing information across tasks under restricted computational resources. However, tasks not only share information but may also compete for resources and conflict with each other, making the optimization of shared parameters difficult and leading to suboptimal performance. We propose a simple but effective training scheme called GradSplit that alleviates this issue by utilizing asymmetric inter-task relations. Specifically, at each convolution module, it splits features into T groups for T tasks and trains each group only using the gradient back-propagated from the task losses with which it does not have conflicts. During training, we apply GradSplit to a series of convolution modules. As a result, each module is trained to generate a set of task-specific features using the shared features from the previous module. This enables a network to use complementary information across tasks while circumventing gradient conflicts. Experimental results show that GradSplit achieves a better accuracy-efficiency trade-off than existing methods. It minimizes accuracy drop caused by task conflicts while significantly saving compute resources in terms of both FLOPs and memory at inference. We further show that GradSplit achieves higher cross-dataset accuracy compared to single-task and other multi-task networks.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121700223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}