Pramod Rao, B. R. Mallikarjun, Gereon Fox, Tim Weyrich, Bernd Bickel, Hanspeter Pfister, Wojciech Matusik, Fangneng Zhan, Ayush Tewari, Christian Theobalt, Mohamed Elgharib
{"title":"A Deeper Analysis of Volumetric Relightiable Faces","authors":"Pramod Rao, B. R. Mallikarjun, Gereon Fox, Tim Weyrich, Bernd Bickel, Hanspeter Pfister, Wojciech Matusik, Fangneng Zhan, Ayush Tewari, Christian Theobalt, Mohamed Elgharib","doi":"10.1007/s11263-023-01899-3","DOIUrl":"https://doi.org/10.1007/s11263-023-01899-3","url":null,"abstract":"<p>Portrait viewpoint and illumination editing is an important problem with several applications in VR/AR, movies, and photography. Comprehensive knowledge of geometry and illumination is critical for obtaining photorealistic results. Current methods are unable to explicitly model in 3<i>D</i> while handling both viewpoint and illumination editing from a single image. In this paper, we propose VoRF, a novel approach that can take even a single portrait image as input and relight human heads under novel illuminations that can be viewed from arbitrary viewpoints. VoRF represents a human head as a continuous volumetric field and learns a prior model of human heads using a coordinate-based MLP with individual latent spaces for identity and illumination. The prior model is learned in an auto-decoder manner over a diverse class of head shapes and appearances, allowing VoRF to generalize to novel test identities from a single input image. Additionally, VoRF has a reflectance MLP that uses the intermediate features of the prior model for rendering One-Light-at-A-Time (OLAT) images under novel views. We synthesize novel illuminations by combining these OLAT images with target environment maps. Qualitative and quantitative evaluations demonstrate the effectiveness of VoRF for relighting and novel view synthesis, even when applied to unseen subjects under uncontrolled illumination. This work is an extension of Rao et al. (VoRF: Volumetric Relightable Faces 2022). We provide extensive evaluation and ablative studies of our model and also provide an application, where any face can be relighted using textual input.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 36","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Soohyun Kim, Jongbeom Baek, Jihye Park, Eunjae Ha, Homin Jung, Taeyoung Lee, Seungryong Kim
{"title":"InstaFormer++: Multi-Domain Instance-Aware Image-to-Image Translation with Transformer","authors":"Soohyun Kim, Jongbeom Baek, Jihye Park, Eunjae Ha, Homin Jung, Taeyoung Lee, Seungryong Kim","doi":"10.1007/s11263-023-01866-y","DOIUrl":"https://doi.org/10.1007/s11263-023-01866-y","url":null,"abstract":"<p>We present a novel Transformer-based network architecture for instance-aware image-to-image translation, dubbed InstaFormer, to effectively integrate global- and instance-level information. By considering extracted content features from an image as visual tokens, our model discovers global consensus of content features by considering context information through self-attention module of Transformers. By augmenting such tokens with an instance-level feature extracted from the content feature with respect to bounding box information, our framework is capable of learning an interaction between object instances and the global image, thus boosting the instance-awareness. We replace layer normalization (LayerNorm) in standard Transformers with adaptive instance normalization (AdaIN) to enable a multi-modal translation with style codes. In addition, to improve the instance-awareness and translation quality at object regions, we present an instance-level content contrastive loss defined between input and translated image. Although competitive performance can be attained by InstaFormer, it may face some limitations, i.e., limited scalability in handling multiple domains, and reliance on domain annotations. To overcome this, we propose InstaFormer++ as an extension of Instaformer, which enables multi-domain translation in instance-aware image translation for the first time. We propose to obtain pseudo domain label by leveraging a list of candidate domain labels in a text format and pretrained vision-language model. We conduct experiments to demonstrate the effectiveness of our methods over the latest methods and provide extensive ablation studies.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 20","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SegViT v2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers","authors":"Bowen Zhang, Liyang Liu, Minh Hieu Phan, Zhi Tian, Chunhua Shen, Yifan Liu","doi":"10.1007/s11263-023-01894-8","DOIUrl":"https://doi.org/10.1007/s11263-023-01894-8","url":null,"abstract":"<p>This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder–decoder framework and introduce <i>SegViTv2</i>. In this study, we introduce a novel Attention-to-Mask (ATM) module to design a lightweight decoder effective for plain ViT. The proposed ATM converts the global attention map into semantic masks for high-quality segmentation results. Our decoder outperforms popular decoder UPerNet using various ViT backbones while consuming only about <span>(5%)</span> of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a <i>Shrunk</i>++ structure that incorporates edge-aware query-based down-sampling (EQD) and query-based up-sampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to <span>(50%)</span> while maintaining competitive performance. Furthermore, we propose to adapt SegViT for continual semantic segmentation, demonstrating nearly zero forgetting of previously learned knowledge. Experiments show that our proposed SegViTv2 surpasses recent segmentation methods on three popular benchmarks including ADE20k, COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the following link: https://github.com/zbwxp/SegVit.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 6","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Symmetry-aware Neural Architecture for Embodied Visual Navigation","authors":"Shuang Liu, Masanori Suganuma, Takayuki Okatani","doi":"10.1007/s11263-023-01909-4","DOIUrl":"https://doi.org/10.1007/s11263-023-01909-4","url":null,"abstract":"<p>The existing methods for addressing visual navigation employ deep reinforcement learning as the standard tool for the task. However, they tend to be vulnerable to statistical shifts between the training and test data, resulting in poor generalization over novel environments that are out-of-distribution from the training data. In this study, we attempt to improve the generalization ability by utilizing the inductive biases available for the task. Employing the active neural SLAM that learns policies with the advantage actor-critic method as the base framework, we first point out that the mappings represented by the actor and the critic should satisfy specific symmetries. We then propose a network design for the actor and the critic to inherently attain these symmetries. Specifically, we use <i>G</i>-convolution instead of the standard convolution and insert the semi-global polar pooling layer, which we newly design in this study, in the last section of the critic network. Our method can be integrated into existing methods that utilize intermediate goals and 2D occupancy maps. Experimental results show that our method improves generalization ability by a good margin over visual exploration and object goal navigation, which are two main embodied visual navigation tasks.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 51","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiwei Lin, Tingting Liang, Taihong Xiao, Yongtao Wang, Ming-Hsuan Yang
{"title":"FlowNAS: Neural Architecture Search for Optical Flow Estimation","authors":"Zhiwei Lin, Tingting Liang, Taihong Xiao, Yongtao Wang, Ming-Hsuan Yang","doi":"10.1007/s11263-023-01920-9","DOIUrl":"https://doi.org/10.1007/s11263-023-01920-9","url":null,"abstract":"<p>Recent optical flow estimators usually employ deep models designed for image classification as the encoders for feature extraction and matching. However, those encoders developed for image classification may be sub-optimal for flow estimation. In contrast, the decoder design of optical flow estimators often requires meticulous design for flow estimation. The disconnect between the encoder and decoder could negatively affect optical flow estimation. To address this issue, we propose a neural architecture search method, FlowNAS, to automatically find the more suitable and stronger encoder architecture for existing flow decoders. We first design a suitable search space, including various convolutional operators, and construct a weight-sharing super-network for efficiently evaluating the candidate architectures. To better train the super-network, we present a Feature Alignment Distillation module that utilizes a well-trained flow estimator to guide the training of the super-network. Finally, a resource-constrained evolutionary algorithm is exploited to determine an optimal architecture (i.e., sub-network). Experimental results show that FlowNAS can be easily incorporated into existing flow estimators and achieves state-of-the-art performance with the trade-off between accuracy and efficiency. Furthermore, the encoder architecture discovered by FlowNAS with the weights inherited from the super-network achieves 4.67% F1-all error on KITTI, an 8.4% reduction of RAFT baseline, surpassing state-of-the-art handcrafted GMA and AGFlow models, while reducing the model complexity and latency. The source code and trained models will be released at https://github.com/VDIGPKU/FlowNAS.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 50","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes","authors":"Shengyu Hao, Peiyuan Liu, Yibing Zhan, Kaixun Jin, Zuozhu Liu, Mingli Song, Jenq-Neng Hwang, Gaoang Wang","doi":"10.1007/s11263-023-01922-7","DOIUrl":"https://doi.org/10.1007/s11263-023-01922-7","url":null,"abstract":"<p>Cross-view multi-object tracking aims to link objects between frames and camera views with substantial overlaps. Although cross-view multi-object tracking has received increased attention in recent years, existing datasets still have several issues, including (1) missing real-world scenarios, (2) lacking diverse scenes, (3) containing a limited number of tracks, (4) comprising only static cameras, and (5) lacking standard benchmarks, which hinder the investigation and comparison of cross-view tracking methods. To solve the aforementioned issues, we introduce <i>DIVOTrack</i>: a new cross-view multi-object tracking dataset for DIVerse Open scenes with dense tracking pedestrians in realistic and non-experimental environments. Our DIVOTrack has fifteen distinct scenarios and 953 cross-view tracks, surpassing all cross-view multi-object tracking datasets currently available. Furthermore, we provide a novel baseline cross-view tracking method with a unified joint detection and cross-view tracking framework named <i>CrossMOT</i>, which learns object detection, single-view association, and cross-view matching with an <i>all-in-one</i> embedding model. Finally, we present a summary of current methodologies and a set of standard benchmarks with our DIVOTrack to provide a fair comparison and conduct a comprehensive analysis of current approaches and our proposed CrossMOT. The dataset and code are available at https://github.com/shengyuhao/DIVOTrack.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 12","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Language-Aware Soft Prompting: Text-to-Text Optimization for Few- and Zero-Shot Adaptation of V &L Models","authors":"Adrian Bulat, Georgios Tzimiropoulos","doi":"10.1007/s11263-023-01904-9","DOIUrl":"https://doi.org/10.1007/s11263-023-01904-9","url":null,"abstract":"<p>Soft prompt learning has emerged as a promising direction for adapting V &L models to a downstream task using a few training examples. However, current methods significantly overfit the training data suffering from large accuracy degradation when tested on unseen classes from the same domain. In addition, all prior methods operate exclusively under the assumption that both vision and language data is present. To this end, we make the following 5 contributions: (1) To alleviate base class overfitting, we propose a novel Language-Aware Soft Prompting (LASP) learning method by means of a text-to-text cross-entropy loss that maximizes the probability of the learned prompts to be correctly classified with respect to pre-defined hand-crafted textual prompts. (2) To increase the representation capacity of the prompts, we also propose <i>grouped</i> LASP where each group of prompts is optimized with respect to a separate subset of textual prompts. (3) Moreover, we identify a visual-language misalignment introduced by prompt learning and LASP, and more importantly, propose a re-calibration mechanism to address it. (4) Importantly, we show that LASP is inherently amenable to including, during training, <i>virtual classes</i>, i.e. class names for which no visual samples are available, further increasing the robustness of the learned prompts. Expanding for the first time the setting to language-only adaptation, (5) we present a novel zero-shot variant of LASP where no visual samples at all are available for the downstream task. Through evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets. Finally, (c) we show that our zero-shot variant improves upon CLIP without requiring any extra data. Code will be made available.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 47","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a Unified Network for Robust Monocular Depth Estimation: Network Architecture, Training Strategy and Dataset","authors":"Mochu Xiang, Yuchao Dai, Feiyu Zhang, Jiawei Shi, Xinyu Tian, Zhensong Zhang","doi":"10.1007/s11263-023-01915-6","DOIUrl":"https://doi.org/10.1007/s11263-023-01915-6","url":null,"abstract":"<p>Robust monocular depth estimation (MDE) aims at learning a <i>unified</i> model that works across diverse real-world scenes, which is an important and active topic in computer vision. In this paper, we present <span>Megatron_RVC</span>, our winning solution for the monocular depth challenge in the Robust Vision Challenge (RVC) 2022, where we tackle the challenging problem from three perspectives: network architecture, training strategy and dataset. In particular, we made three contributions towards robust MDE: (1) we built a neural network with high capacity to enable flexible and accurate monocular depth predictions, which contains dedicated components to provide content-aware embeddings and to improve the richness of the details; (2) we proposed a novel mixing training strategy to handle real-world images with different aspect ratios, resolutions and apply tailored loss functions based on the properties of their depth maps; (3) to train a unified network model that covers diverse real-world scenes, we used over 1 million images from different datasets. As of 3rd October 2022, our unified model ranked consistently first across three benchmarks (KITTI, MPI Sintel, and VIPER) among all participants.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 48","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Feng Xue, Yicong Chang, Tianxi Wang, Yu Zhou, Anlong Ming
{"title":"Indoor Obstacle Discovery on Reflective Ground via Monocular Camera","authors":"Feng Xue, Yicong Chang, Tianxi Wang, Yu Zhou, Anlong Ming","doi":"10.1007/s11263-023-01925-4","DOIUrl":"https://doi.org/10.1007/s11263-023-01925-4","url":null,"abstract":"<p>Visual obstacle discovery is a key step towards autonomous navigation of indoor mobile robots. Successful solutions have many applications in multiple scenes. One of the exceptions is the reflective ground. In this case, the reflections on the floor resemble the true world, which confuses the obstacle discovery and leaves navigation unsuccessful. We argue that the key to this problem lies in obtaining discriminative features for reflections and obstacles. Note that obstacle and reflection can be separated by the ground plane in 3D space. With this observation, we firstly introduce a pre-calibration based ground detection scheme that uses robot motion to predict the ground plane. Due to the immunity of robot motion to reflection, this scheme avoids failed ground detection caused by reflection. Given the detected ground, we design a ground-pixel parallax to describe the location of a pixel relative to the ground. Based on this, a unified appearance-geometry feature representation is proposed to describe objects inside rectangular boxes. Eventually, based on segmenting by detection framework, an appearance-geometry fusion regressor is designed to utilize the proposed feature to discover the obstacles. It also prevents our model from concentrating too much on parts of obstacles instead of whole obstacles. For evaluation, we introduce a new dataset for Obstacle on Reflective Ground (ORG), which comprises 15 scenes with various ground reflections, a total of more than 200 image sequences and 3400 RGB images. The pixel-wise annotations of ground and obstacle provide a comparison to our method and other methods. By reducing the misdetection of the reflection, the proposed approach outperforms others. The source code and the dataset will be available at https://github.com/xuefeng-cvr/IndoorObstacleDiscovery-RG</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 49","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuyang Zhao, Zhun Zhong, Na Zhao, Nicu Sebe, Gim Hee Lee
{"title":"Style-Hallucinated Dual Consistency Learning: A Unified Framework for Visual Domain Generalization","authors":"Yuyang Zhao, Zhun Zhong, Na Zhao, Nicu Sebe, Gim Hee Lee","doi":"10.1007/s11263-023-01911-w","DOIUrl":"https://doi.org/10.1007/s11263-023-01911-w","url":null,"abstract":"<p>Domain shift widely exists in the visual world, while modern deep neural networks commonly suffer from severe performance degradation under domain shift due to poor generalization ability, which limits real-world applications. The domain shift mainly lies in the limited source environmental variations and the large distribution gap between source and unseen target data. To this end, we propose a unified framework, Style-HAllucinated Dual consistEncy learning (SHADE), to handle such domain shift in various visual tasks. Specifically, SHADE is constructed based on two consistency constraints, Style Consistency (SC) and Retrospection Consistency (RC). SC enriches the source situations and encourages the model to learn consistent representation across style-diversified samples. RC leverages general visual knowledge to prevent the model from overfitting to source data and thus largely keeps the representation consistent between the source and general visual models. Furthermore, we present a novel style hallucination module (SHM) to generate style-diversified samples that are essential to consistency learning. SHM selects basis styles from the source distribution, enabling the model to dynamically generate diverse and realistic samples during training. Extensive experiments demonstrate that our versatile SHADE can significantly enhance the generalization in various visual recognition tasks, including image classification, semantic segmentation, and object detection, with different models, i.e., ConvNets and Transformer.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"57 9","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71512812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}