International Journal of Computer Vision最新文献

筛选
英文 中文
A Deeper Analysis of Volumetric Relightiable Faces 体积可靠面的深入分析
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2023-10-31 DOI: 10.1007/s11263-023-01899-3
Pramod Rao, B. R. Mallikarjun, Gereon Fox, Tim Weyrich, Bernd Bickel, Hanspeter Pfister, Wojciech Matusik, Fangneng Zhan, Ayush Tewari, Christian Theobalt, Mohamed Elgharib
{"title":"A Deeper Analysis of Volumetric Relightiable Faces","authors":"Pramod Rao, B. R. Mallikarjun, Gereon Fox, Tim Weyrich, Bernd Bickel, Hanspeter Pfister, Wojciech Matusik, Fangneng Zhan, Ayush Tewari, Christian Theobalt, Mohamed Elgharib","doi":"10.1007/s11263-023-01899-3","DOIUrl":"https://doi.org/10.1007/s11263-023-01899-3","url":null,"abstract":"<p>Portrait viewpoint and illumination editing is an important problem with several applications in VR/AR, movies, and photography. Comprehensive knowledge of geometry and illumination is critical for obtaining photorealistic results. Current methods are unable to explicitly model in 3<i>D</i> while handling both viewpoint and illumination editing from a single image. In this paper, we propose VoRF, a novel approach that can take even a single portrait image as input and relight human heads under novel illuminations that can be viewed from arbitrary viewpoints. VoRF represents a human head as a continuous volumetric field and learns a prior model of human heads using a coordinate-based MLP with individual latent spaces for identity and illumination. The prior model is learned in an auto-decoder manner over a diverse class of head shapes and appearances, allowing VoRF to generalize to novel test identities from a single input image. Additionally, VoRF has a reflectance MLP that uses the intermediate features of the prior model for rendering One-Light-at-A-Time (OLAT) images under novel views. We synthesize novel illuminations by combining these OLAT images with target environment maps. Qualitative and quantitative evaluations demonstrate the effectiveness of VoRF for relighting and novel view synthesis, even when applied to unseen subjects under uncontrolled illumination. This work is an extension of Rao et al. (VoRF: Volumetric Relightable Faces 2022). We provide extensive evaluation and ablative studies of our model and also provide an application, where any face can be relighted using textual input.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 36","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
InstaFormer++: Multi-Domain Instance-Aware Image-to-Image Translation with Transformer InstaFormer++:使用Transformer实现多域实例感知的图像到图像转换
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2023-10-31 DOI: 10.1007/s11263-023-01866-y
Soohyun Kim, Jongbeom Baek, Jihye Park, Eunjae Ha, Homin Jung, Taeyoung Lee, Seungryong Kim
{"title":"InstaFormer++: Multi-Domain Instance-Aware Image-to-Image Translation with Transformer","authors":"Soohyun Kim, Jongbeom Baek, Jihye Park, Eunjae Ha, Homin Jung, Taeyoung Lee, Seungryong Kim","doi":"10.1007/s11263-023-01866-y","DOIUrl":"https://doi.org/10.1007/s11263-023-01866-y","url":null,"abstract":"<p>We present a novel Transformer-based network architecture for instance-aware image-to-image translation, dubbed InstaFormer, to effectively integrate global- and instance-level information. By considering extracted content features from an image as visual tokens, our model discovers global consensus of content features by considering context information through self-attention module of Transformers. By augmenting such tokens with an instance-level feature extracted from the content feature with respect to bounding box information, our framework is capable of learning an interaction between object instances and the global image, thus boosting the instance-awareness. We replace layer normalization (LayerNorm) in standard Transformers with adaptive instance normalization (AdaIN) to enable a multi-modal translation with style codes. In addition, to improve the instance-awareness and translation quality at object regions, we present an instance-level content contrastive loss defined between input and translated image. Although competitive performance can be attained by InstaFormer, it may face some limitations, i.e., limited scalability in handling multiple domains, and reliance on domain annotations. To overcome this, we propose InstaFormer++ as an extension of Instaformer, which enables multi-domain translation in instance-aware image translation for the first time. We propose to obtain pseudo domain label by leveraging a list of candidate domain labels in a text format and pretrained vision-language model. We conduct experiments to demonstrate the effectiveness of our methods over the latest methods and provide extensive ablation studies.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 20","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SegViT v2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers SegViT v2:用普通视觉转换器探索高效连续的语义分割
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2023-10-27 DOI: 10.1007/s11263-023-01894-8
Bowen Zhang, Liyang Liu, Minh Hieu Phan, Zhi Tian, Chunhua Shen, Yifan Liu
{"title":"SegViT v2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers","authors":"Bowen Zhang, Liyang Liu, Minh Hieu Phan, Zhi Tian, Chunhua Shen, Yifan Liu","doi":"10.1007/s11263-023-01894-8","DOIUrl":"https://doi.org/10.1007/s11263-023-01894-8","url":null,"abstract":"<p>This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder–decoder framework and introduce <i>SegViTv2</i>. In this study, we introduce a novel Attention-to-Mask (ATM) module to design a lightweight decoder effective for plain ViT. The proposed ATM converts the global attention map into semantic masks for high-quality segmentation results. Our decoder outperforms popular decoder UPerNet using various ViT backbones while consuming only about <span>(5%)</span> of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a <i>Shrunk</i>++ structure that incorporates edge-aware query-based down-sampling (EQD) and query-based up-sampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to <span>(50%)</span> while maintaining competitive performance. Furthermore, we propose to adapt SegViT for continual semantic segmentation, demonstrating nearly zero forgetting of previously learned knowledge. Experiments show that our proposed SegViTv2 surpasses recent segmentation methods on three popular benchmarks including ADE20k, COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the following link: https://github.com/zbwxp/SegVit.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 6","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Symmetry-aware Neural Architecture for Embodied Visual Navigation 用于嵌入式视觉导航的对称感知神经结构
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2023-10-25 DOI: 10.1007/s11263-023-01909-4
Shuang Liu, Masanori Suganuma, Takayuki Okatani
{"title":"Symmetry-aware Neural Architecture for Embodied Visual Navigation","authors":"Shuang Liu, Masanori Suganuma, Takayuki Okatani","doi":"10.1007/s11263-023-01909-4","DOIUrl":"https://doi.org/10.1007/s11263-023-01909-4","url":null,"abstract":"<p>The existing methods for addressing visual navigation employ deep reinforcement learning as the standard tool for the task. However, they tend to be vulnerable to statistical shifts between the training and test data, resulting in poor generalization over novel environments that are out-of-distribution from the training data. In this study, we attempt to improve the generalization ability by utilizing the inductive biases available for the task. Employing the active neural SLAM that learns policies with the advantage actor-critic method as the base framework, we first point out that the mappings represented by the actor and the critic should satisfy specific symmetries. We then propose a network design for the actor and the critic to inherently attain these symmetries. Specifically, we use <i>G</i>-convolution instead of the standard convolution and insert the semi-global polar pooling layer, which we newly design in this study, in the last section of the critic network. Our method can be integrated into existing methods that utilize intermediate goals and 2D occupancy maps. Experimental results show that our method improves generalization ability by a good margin over visual exploration and object goal navigation, which are two main embodied visual navigation tasks.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 51","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlowNAS: Neural Architecture Search for Optical Flow Estimation FlowNAS:用于光流估计的神经结构搜索
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2023-10-25 DOI: 10.1007/s11263-023-01920-9
Zhiwei Lin, Tingting Liang, Taihong Xiao, Yongtao Wang, Ming-Hsuan Yang
{"title":"FlowNAS: Neural Architecture Search for Optical Flow Estimation","authors":"Zhiwei Lin, Tingting Liang, Taihong Xiao, Yongtao Wang, Ming-Hsuan Yang","doi":"10.1007/s11263-023-01920-9","DOIUrl":"https://doi.org/10.1007/s11263-023-01920-9","url":null,"abstract":"<p>Recent optical flow estimators usually employ deep models designed for image classification as the encoders for feature extraction and matching. However, those encoders developed for image classification may be sub-optimal for flow estimation. In contrast, the decoder design of optical flow estimators often requires meticulous design for flow estimation. The disconnect between the encoder and decoder could negatively affect optical flow estimation. To address this issue, we propose a neural architecture search method, FlowNAS, to automatically find the more suitable and stronger encoder architecture for existing flow decoders. We first design a suitable search space, including various convolutional operators, and construct a weight-sharing super-network for efficiently evaluating the candidate architectures. To better train the super-network, we present a Feature Alignment Distillation module that utilizes a well-trained flow estimator to guide the training of the super-network. Finally, a resource-constrained evolutionary algorithm is exploited to determine an optimal architecture (i.e., sub-network). Experimental results show that FlowNAS can be easily incorporated into existing flow estimators and achieves state-of-the-art performance with the trade-off between accuracy and efficiency. Furthermore, the encoder architecture discovered by FlowNAS with the weights inherited from the super-network achieves 4.67% F1-all error on KITTI, an 8.4% reduction of RAFT baseline, surpassing state-of-the-art handcrafted GMA and AGFlow models, while reducing the model complexity and latency. The source code and trained models will be released at https://github.com/VDIGPKU/FlowNAS.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 50","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes DIVOTrack:一种用于DIVerse开放场景中跨视图多目标跟踪的新数据集和基线方法
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2023-10-25 DOI: 10.1007/s11263-023-01922-7
Shengyu Hao, Peiyuan Liu, Yibing Zhan, Kaixun Jin, Zuozhu Liu, Mingli Song, Jenq-Neng Hwang, Gaoang Wang
{"title":"DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes","authors":"Shengyu Hao, Peiyuan Liu, Yibing Zhan, Kaixun Jin, Zuozhu Liu, Mingli Song, Jenq-Neng Hwang, Gaoang Wang","doi":"10.1007/s11263-023-01922-7","DOIUrl":"https://doi.org/10.1007/s11263-023-01922-7","url":null,"abstract":"<p>Cross-view multi-object tracking aims to link objects between frames and camera views with substantial overlaps. Although cross-view multi-object tracking has received increased attention in recent years, existing datasets still have several issues, including (1) missing real-world scenarios, (2) lacking diverse scenes, (3) containing a limited number of tracks, (4) comprising only static cameras, and (5) lacking standard benchmarks, which hinder the investigation and comparison of cross-view tracking methods. To solve the aforementioned issues, we introduce <i>DIVOTrack</i>: a new cross-view multi-object tracking dataset for DIVerse Open scenes with dense tracking pedestrians in realistic and non-experimental environments. Our DIVOTrack has fifteen distinct scenarios and 953 cross-view tracks, surpassing all cross-view multi-object tracking datasets currently available. Furthermore, we provide a novel baseline cross-view tracking method with a unified joint detection and cross-view tracking framework named <i>CrossMOT</i>, which learns object detection, single-view association, and cross-view matching with an <i>all-in-one</i> embedding model. Finally, we present a summary of current methodologies and a set of standard benchmarks with our DIVOTrack to provide a fair comparison and conduct a comprehensive analysis of current approaches and our proposed CrossMOT. The dataset and code are available at https://github.com/shengyuhao/DIVOTrack.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 12","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Language-Aware Soft Prompting: Text-to-Text Optimization for Few- and Zero-Shot Adaptation of V &L Models 语言软件软提示:用于V&L模型的少数和零样本适应的文本到文本优化
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2023-10-25 DOI: 10.1007/s11263-023-01904-9
Adrian Bulat, Georgios Tzimiropoulos
{"title":"Language-Aware Soft Prompting: Text-to-Text Optimization for Few- and Zero-Shot Adaptation of V &L Models","authors":"Adrian Bulat, Georgios Tzimiropoulos","doi":"10.1007/s11263-023-01904-9","DOIUrl":"https://doi.org/10.1007/s11263-023-01904-9","url":null,"abstract":"<p>Soft prompt learning has emerged as a promising direction for adapting V &amp;L models to a downstream task using a few training examples. However, current methods significantly overfit the training data suffering from large accuracy degradation when tested on unseen classes from the same domain. In addition, all prior methods operate exclusively under the assumption that both vision and language data is present. To this end, we make the following 5 contributions: (1) To alleviate base class overfitting, we propose a novel Language-Aware Soft Prompting (LASP) learning method by means of a text-to-text cross-entropy loss that maximizes the probability of the learned prompts to be correctly classified with respect to pre-defined hand-crafted textual prompts. (2) To increase the representation capacity of the prompts, we also propose <i>grouped</i> LASP where each group of prompts is optimized with respect to a separate subset of textual prompts. (3) Moreover, we identify a visual-language misalignment introduced by prompt learning and LASP, and more importantly, propose a re-calibration mechanism to address it. (4) Importantly, we show that LASP is inherently amenable to including, during training, <i>virtual classes</i>, i.e. class names for which no visual samples are available, further increasing the robustness of the learned prompts. Expanding for the first time the setting to language-only adaptation, (5) we present a novel zero-shot variant of LASP where no visual samples at all are available for the downstream task. Through evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets. Finally, (c) we show that our zero-shot variant improves upon CLIP without requiring any extra data. Code will be made available.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 47","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards a Unified Network for Robust Monocular Depth Estimation: Network Architecture, Training Strategy and Dataset 实现稳健单目深度估计的统一网络:网络结构、训练策略和数据集
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2023-10-23 DOI: 10.1007/s11263-023-01915-6
Mochu Xiang, Yuchao Dai, Feiyu Zhang, Jiawei Shi, Xinyu Tian, Zhensong Zhang
{"title":"Towards a Unified Network for Robust Monocular Depth Estimation: Network Architecture, Training Strategy and Dataset","authors":"Mochu Xiang, Yuchao Dai, Feiyu Zhang, Jiawei Shi, Xinyu Tian, Zhensong Zhang","doi":"10.1007/s11263-023-01915-6","DOIUrl":"https://doi.org/10.1007/s11263-023-01915-6","url":null,"abstract":"<p>Robust monocular depth estimation (MDE) aims at learning a <i>unified</i> model that works across diverse real-world scenes, which is an important and active topic in computer vision. In this paper, we present <span>Megatron_RVC</span>, our winning solution for the monocular depth challenge in the Robust Vision Challenge (RVC) 2022, where we tackle the challenging problem from three perspectives: network architecture, training strategy and dataset. In particular, we made three contributions towards robust MDE: (1) we built a neural network with high capacity to enable flexible and accurate monocular depth predictions, which contains dedicated components to provide content-aware embeddings and to improve the richness of the details; (2) we proposed a novel mixing training strategy to handle real-world images with different aspect ratios, resolutions and apply tailored loss functions based on the properties of their depth maps; (3) to train a unified network model that covers diverse real-world scenes, we used over 1 million images from different datasets. As of 3rd October 2022, our unified model ranked consistently first across three benchmarks (KITTI, MPI Sintel, and VIPER) among all participants.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 48","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Indoor Obstacle Discovery on Reflective Ground via Monocular Camera 单目摄像机在反射地面上发现室内障碍物
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2023-10-20 DOI: 10.1007/s11263-023-01925-4
Feng Xue, Yicong Chang, Tianxi Wang, Yu Zhou, Anlong Ming
{"title":"Indoor Obstacle Discovery on Reflective Ground via Monocular Camera","authors":"Feng Xue, Yicong Chang, Tianxi Wang, Yu Zhou, Anlong Ming","doi":"10.1007/s11263-023-01925-4","DOIUrl":"https://doi.org/10.1007/s11263-023-01925-4","url":null,"abstract":"<p>Visual obstacle discovery is a key step towards autonomous navigation of indoor mobile robots. Successful solutions have many applications in multiple scenes. One of the exceptions is the reflective ground. In this case, the reflections on the floor resemble the true world, which confuses the obstacle discovery and leaves navigation unsuccessful. We argue that the key to this problem lies in obtaining discriminative features for reflections and obstacles. Note that obstacle and reflection can be separated by the ground plane in 3D space. With this observation, we firstly introduce a pre-calibration based ground detection scheme that uses robot motion to predict the ground plane. Due to the immunity of robot motion to reflection, this scheme avoids failed ground detection caused by reflection. Given the detected ground, we design a ground-pixel parallax to describe the location of a pixel relative to the ground. Based on this, a unified appearance-geometry feature representation is proposed to describe objects inside rectangular boxes. Eventually, based on segmenting by detection framework, an appearance-geometry fusion regressor is designed to utilize the proposed feature to discover the obstacles. It also prevents our model from concentrating too much on parts of obstacles instead of whole obstacles. For evaluation, we introduce a new dataset for Obstacle on Reflective Ground (ORG), which comprises 15 scenes with various ground reflections, a total of more than 200 image sequences and 3400 RGB images. The pixel-wise annotations of ground and obstacle provide a comparison to our method and other methods. By reducing the misdetection of the reflection, the proposed approach outperforms others. The source code and the dataset will be available at https://github.com/xuefeng-cvr/IndoorObstacleDiscovery-RG</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"31 49","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71492519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Style-Hallucinated Dual Consistency Learning: A Unified Framework for Visual Domain Generalization 风格幻觉双一致性学习:视觉领域泛化的统一框架
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2023-10-18 DOI: 10.1007/s11263-023-01911-w
Yuyang Zhao, Zhun Zhong, Na Zhao, Nicu Sebe, Gim Hee Lee
{"title":"Style-Hallucinated Dual Consistency Learning: A Unified Framework for Visual Domain Generalization","authors":"Yuyang Zhao, Zhun Zhong, Na Zhao, Nicu Sebe, Gim Hee Lee","doi":"10.1007/s11263-023-01911-w","DOIUrl":"https://doi.org/10.1007/s11263-023-01911-w","url":null,"abstract":"<p>Domain shift widely exists in the visual world, while modern deep neural networks commonly suffer from severe performance degradation under domain shift due to poor generalization ability, which limits real-world applications. The domain shift mainly lies in the limited source environmental variations and the large distribution gap between source and unseen target data. To this end, we propose a unified framework, Style-HAllucinated Dual consistEncy learning (SHADE), to handle such domain shift in various visual tasks. Specifically, SHADE is constructed based on two consistency constraints, Style Consistency (SC) and Retrospection Consistency (RC). SC enriches the source situations and encourages the model to learn consistent representation across style-diversified samples. RC leverages general visual knowledge to prevent the model from overfitting to source data and thus largely keeps the representation consistent between the source and general visual models. Furthermore, we present a novel style hallucination module (SHM) to generate style-diversified samples that are essential to consistency learning. SHM selects basis styles from the source distribution, enabling the model to dynamically generate diverse and realistic samples during training. Extensive experiments demonstrate that our versatile SHADE can significantly enhance the generalization in various visual recognition tasks, including image classification, semantic segmentation, and object detection, with different models, i.e., ConvNets and Transformer.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"57 9","pages":""},"PeriodicalIF":19.5,"publicationDate":"2023-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"71512812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信