Computer Vision and Image Understanding最新文献

筛选
英文 中文
Adaptive semantic guidance network for video captioning
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-02-01 DOI: 10.1016/j.cviu.2024.104255
Yuanyuan Liu , Hong Zhu , Zhong Wu , Sen Du , Shuning Wu , Jing Shi
{"title":"Adaptive semantic guidance network for video captioning","authors":"Yuanyuan Liu ,&nbsp;Hong Zhu ,&nbsp;Zhong Wu ,&nbsp;Sen Du ,&nbsp;Shuning Wu ,&nbsp;Jing Shi","doi":"10.1016/j.cviu.2024.104255","DOIUrl":"10.1016/j.cviu.2024.104255","url":null,"abstract":"<div><div>Video captioning aims to describe video content using natural language, and effectively integrating information of visual and textual is crucial for generating accurate captions. However, we find that the existing methods over-rely on the language-prior information about the text acquired by training, resulting in the model tending to output high-frequency fixed phrases. In order to solve the above problems, we extract high-quality semantic information from multi-modal input and then build a semantic guidance mechanism to adapt to the contribution of visual semantics and text semantics to generate captions. We propose an Adaptive Semantic Guidance Network (ASGNet) for video captioning. The ASGNet consists of a Semantic Enhancement Encoder (SEE) and an Adaptive Control Decoder (ACD). Specifically, the SEE helps the model obtain high-quality semantic representations by exploring the rich semantic information from visual and textual. The ACD dynamically adjusts the contribution weights of semantics about visual and textual for word generation, guiding the model to adaptively focus on the correct semantic information. These two modules work together to help the model overcome the problem of over-reliance on language priors, resulting in more accurate video captions. Finally, we conducted extensive experiments on commonly used video captioning datasets. MSVD and MSR-VTT reached the state-of-the-art, and YouCookII also achieved good performance. These experiments fully verified the advantages of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104255"},"PeriodicalIF":4.3,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adversarial Semi-supervised domain adaptation for semantic segmentation: A new role for labeled target samples
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-01-30 DOI: 10.1016/j.cviu.2025.104305
Marwa Kechaou , Mokhtar Z. Alaya , Romain Hérault , Gilles Gasso
{"title":"Adversarial Semi-supervised domain adaptation for semantic segmentation: A new role for labeled target samples","authors":"Marwa Kechaou ,&nbsp;Mokhtar Z. Alaya ,&nbsp;Romain Hérault ,&nbsp;Gilles Gasso","doi":"10.1016/j.cviu.2025.104305","DOIUrl":"10.1016/j.cviu.2025.104305","url":null,"abstract":"<div><div>Adversarial learning baselines for domain adaptation (DA) approaches in the context of semantic segmentation are under explored in semi-supervised framework. These baselines involve solely the available labeled target samples in the supervision loss. In this work, we propose to enhance their usefulness on both semantic segmentation and the single domain classifier neural networks. We design new training objective losses for cases when labeled target data behave as source samples or as real target samples. The underlying rationale is that considering the set of labeled target samples as part of source domain helps reducing the domain discrepancy and, hence, improves the contribution of the adversarial loss. To support our approach, we consider a complementary method that mixes source and labeled target data, then applies the same adaptation process. We further propose an unsupervised selection procedure using entropy to optimize the choice of labeled target samples for adaptation. We illustrate our findings through extensive experiments on the benchmarks GTA5, SYNTHIA, and Cityscapes. The empirical evaluation highlights competitive performance of our proposed approach.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104305"},"PeriodicalIF":4.3,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143349645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mask prior generation with language queries guided networks for referring image segmentation
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-01-29 DOI: 10.1016/j.cviu.2025.104296
Jinhao Zhou , Guoqiang Xiao , Michael S. Lew , Song Wu
{"title":"Mask prior generation with language queries guided networks for referring image segmentation","authors":"Jinhao Zhou ,&nbsp;Guoqiang Xiao ,&nbsp;Michael S. Lew ,&nbsp;Song Wu","doi":"10.1016/j.cviu.2025.104296","DOIUrl":"10.1016/j.cviu.2025.104296","url":null,"abstract":"<div><div>The aim of Referring Image Segmentation (RIS) is to generate a pixel-level mask to accurately segment the target object according to its natural language expression. Previous RIS methods ignore exploring the significant language information in both the encoder and decoder stages, and simply use an upsampling-convolution operation to obtain the prediction mask, resulting in inaccurate visual object locating. Thus, this paper proposes a Mask Prior Generation with Language Queries Guided Network (MPG-LQGNet). In the encoder of MPG-LQGNet, a Bidirectional Spatial Alignment Module (BSAM) is designed to realize the bidirectional fusion for both vision and language embeddings, generating additional language queries to understand both the locating of targets and the semantics of the language. Moreover, a Channel Attention Fusion Gate (CAFG) is designed to enhance the exploration of the significance of the cross-modal embeddings. In the decoder of the MPG-LQGNet, the Language Query Guided Mask Prior Generator (LQPG) is designed to utilize the generated language queries to activate significant information in the upsampled decoding features, obtaining the more accurate mask prior that guides the final prediction. Extensive experiments on RefCOCO series datasets show that our method consistently improves over state-of-the-art methods. The source code of our MPG-LQGNet is available at <span><span>https://github.com/SWU-CS-MediaLab/MPG-LQGNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104296"},"PeriodicalIF":4.3,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143136341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Luminance prior guided Low-Light 4C catenary image enhancement
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-01-27 DOI: 10.1016/j.cviu.2025.104287
Zhenhua Xue , Jun Luo , Zhenlin Wei
{"title":"Luminance prior guided Low-Light 4C catenary image enhancement","authors":"Zhenhua Xue ,&nbsp;Jun Luo ,&nbsp;Zhenlin Wei","doi":"10.1016/j.cviu.2025.104287","DOIUrl":"10.1016/j.cviu.2025.104287","url":null,"abstract":"<div><div>In scenarios characterized by inadequate fill lighting, catenary images captured by railway power supply 4C monitoring equipment often exhibit a phenomenon of low light, which can pose significant challenges for accurately detecting anomalies in the equipment. This, in turn, has ramifications for the smooth operation, timely maintenance, and overall safety assurance of railway systems. Recognizing this critical issue, our study introduces an innovative dual-branch priori-guided enhancement method specifically tailored for low-light catenary images obtained through powered 4C monitoring equipment. Within the multi-scale branch of our method, we leverage the powerful capabilities of convolutional neural networks (CNNs) along with the self-attention mechanism to effectively extract both local and global features from the images. This dual focus allows our model to capture intricate details and broader contextual information, enhancing its ability to understand and enhance the images. Concurrently, the pixel-wise branch of our method is designed to estimate enhancement parameters at the pixel level, enabling an adaptive and iterative enhancement process. This fine-grained approach ensures that each pixel in the image is optimized based on its unique characteristics and context, leading to more nuanced and accurate enhancements. To further inform and constrain our enhancement process, we conduct a statistical analysis of the average light intensity of images under both normal and low-light conditions. By examining the differences and correlations between image brightness under these varying light conditions, we derive statistical priors that are integrated into our method. These priors serve as valuable guidance for our model, helping it to make more informed decisions during the enhancement process. Moreover, to mitigate the challenges associated with obtaining labeled data, we adopt an unsupervised model training strategy. This approach allows our method to learn and improve without the need for extensive and costly labeling efforts, making it more practical and scalable for real-world applications. Experimental results demonstrate the superiority of our proposed method when compared to state-of-the-art approaches for low-light catenary image enhancement. Our method improves the visual quality of the images, ultimately contributing to the safety and efficiency of railway operations.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104287"},"PeriodicalIF":4.3,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143268673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-scale Riemannian meta-optimization via subspace adaptation
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-01-27 DOI: 10.1016/j.cviu.2025.104306
Peilin Yu , Yuwei Wu , Zhi Gao , Xiaomeng Fan , Yunde Jia
{"title":"Large-scale Riemannian meta-optimization via subspace adaptation","authors":"Peilin Yu ,&nbsp;Yuwei Wu ,&nbsp;Zhi Gao ,&nbsp;Xiaomeng Fan ,&nbsp;Yunde Jia","doi":"10.1016/j.cviu.2025.104306","DOIUrl":"10.1016/j.cviu.2025.104306","url":null,"abstract":"<div><div>Riemannian meta-optimization provides a promising approach to solving non-linear constrained optimization problems, which trains neural networks as optimizers to perform optimization on Riemannian manifolds. However, existing Riemannian meta-optimization methods take up huge memory footprints in large-scale optimization settings, as the learned optimizer can only adapt gradients of a fixed size and thus cannot be shared across different Riemannian parameters. In this paper, we propose an efficient Riemannian meta-optimization method that significantly reduces the memory burden for large-scale optimization via a subspace adaptation scheme. Our method trains neural networks to individually adapt the row and column subspaces of Riemannian gradients, instead of directly adapting the full gradient matrices in existing Riemannian meta-optimization methods. In this case, our learned optimizer can be shared across Riemannian parameters with different sizes. Our method reduces the model memory consumption by six orders of magnitude when optimizing an orthogonal mainstream deep neural network (<em>e.g.</em> ResNet50). Experiments on multiple Riemannian tasks show that our method can not only reduce the memory consumption but also improve the performance of Riemannian meta-optimization.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104306"},"PeriodicalIF":4.3,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143268687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Beyond shadows and light: Odyssey of face recognition for social good
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2025-01-24 DOI: 10.1016/j.cviu.2025.104293
Chiranjeev Chiranjeev, Muskan Dosi, Shivang Agarwal, Jyoti Chaudhary , Pranav Pant, Mayank Vatsa, Richa Singh
{"title":"Beyond shadows and light: Odyssey of face recognition for social good","authors":"Chiranjeev Chiranjeev,&nbsp;Muskan Dosi,&nbsp;Shivang Agarwal,&nbsp;Jyoti Chaudhary ,&nbsp;Pranav Pant,&nbsp;Mayank Vatsa,&nbsp;Richa Singh","doi":"10.1016/j.cviu.2025.104293","DOIUrl":"10.1016/j.cviu.2025.104293","url":null,"abstract":"<div><div>Face recognition technology, though undeniably transformative in its technical evolution, remains conspicuously underleveraged in humanitarian endeavors. This survey highlights its latent utility in addressing critical societal exigencies, ranging from the expeditious identification of disaster-afflicted individuals to locating missing children. We investigate technical complexities arising from facial feature degradation, aging, occlusions, and low-resolution images. These issues are frequently encountered in real-world scenarios. We provide a comprehensive review of state-of-the-art models and relevant datasets, including a meta-analysis of existing and curated collections such as the newly introduced Web and Generated Injured Faces (WGIF) dataset. Our evaluation encompasses the performance of current face recognition algorithms in real-world scenarios, exemplified by a case study on the Balasore train accident in India. By examining factors such as the impact of aging on facial features and the limitations of traditional models in handling low-quality or occluded images, we showcase the complexities inherent in applying face recognition for societal good. We discuss future research directions, emphasizing the need for interdisciplinary collaborations and innovative methodologies to enhance the adaptability and robustness of face recognition systems in humanitarian contexts. Through detailed case studies, we provide insights into the effectiveness of current methods and identify key areas for improvement. Our goal is to encourage the development of specialized face recognition models for social welfare applications, contributing to timely and accurate identification in critical situations.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"253 ","pages":"Article 104293"},"PeriodicalIF":4.3,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143419469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UATST: Towards unpaired arbitrary text-guided style transfer with cross-space modulation UATST:通过跨空间调制实现不配对的任意文本引导样式传输
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-20 DOI: 10.1016/j.cviu.2024.104246
Haibo Chen , Lei Zhao
{"title":"UATST: Towards unpaired arbitrary text-guided style transfer with cross-space modulation","authors":"Haibo Chen ,&nbsp;Lei Zhao","doi":"10.1016/j.cviu.2024.104246","DOIUrl":"10.1016/j.cviu.2024.104246","url":null,"abstract":"<div><div>Existing style transfer methods usually utilize style images to represent the target style. Since style images need to be prepared in advance and are confined to existing artworks, these methods are limited in flexibility and creativity. Compared with images, language is a more natural, common, and flexible way for humans to transmit information. Therefore, a better choice is to utilize text descriptions instead of style images to represent the target style. To this end, we propose a novel <strong>U</strong>npaired <strong>A</strong>rbitrary <strong>T</strong>ext-guided <strong>S</strong>tyle <strong>T</strong>ransfer (<strong>UATST</strong>) framework, which can render arbitrary photographs in the style of arbitrary text descriptions with one single model. To the best of our knowledge, this is the first model that achieves Arbitrary-Text-Per-Model with unpaired training data. In detail, we first use a pre-trained VGG network to map the content image into the VGG feature space, and use a pre-trained CLIP text encoder to map the text description into the CLIP feature space. Then we introduce a cross-space modulation module to bridge these two feature spaces, so that the content and style information in two different spaces can be seamlessly and adaptively combined for stylization. In addition, to learn better style representations, we introduce a new CLIP-based style contrastive loss to our model. Extensive qualitative and quantitative experiments verify the effectiveness and superiority of our method.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104246"},"PeriodicalIF":4.3,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142745127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
3D Pose Nowcasting: Forecast the future to improve the present 3D姿态临近投射:预测未来以改善现在
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-20 DOI: 10.1016/j.cviu.2024.104233
Alessandro Simoni , Francesco Marchetti , Guido Borghi , Federico Becattini , Lorenzo Seidenari , Roberto Vezzani , Alberto Del Bimbo
{"title":"3D Pose Nowcasting: Forecast the future to improve the present","authors":"Alessandro Simoni ,&nbsp;Francesco Marchetti ,&nbsp;Guido Borghi ,&nbsp;Federico Becattini ,&nbsp;Lorenzo Seidenari ,&nbsp;Roberto Vezzani ,&nbsp;Alberto Del Bimbo","doi":"10.1016/j.cviu.2024.104233","DOIUrl":"10.1016/j.cviu.2024.104233","url":null,"abstract":"<div><div>Technologies to enable safe and effective collaboration and coexistence between humans and robots have gained significant importance in the last few years. A critical component useful for realizing this collaborative paradigm is the understanding of human and robot 3D poses using non-invasive systems. Therefore, in this paper, we propose a novel vision-based system leveraging depth data to accurately establish the 3D locations of skeleton joints. Specifically, we introduce the concept of Pose Nowcasting, denoting the capability of the proposed system to enhance its current pose estimation accuracy by jointly learning to forecast future poses. The experimental evaluation is conducted on two different datasets, providing accurate and real-time performance and confirming the validity of the proposed method on both the robotic and human scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"251 ","pages":"Article 104233"},"PeriodicalIF":4.3,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142757045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Scale Adaptive Skeleton Transformer for action recognition 用于动作识别的多尺度自适应骨架变换器
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-19 DOI: 10.1016/j.cviu.2024.104229
Xiaotian Wang , Kai Chen , Zhifu Zhao , Guangming Shi , Xuemei Xie , Xiang Jiang , Yifan Yang
{"title":"Multi-Scale Adaptive Skeleton Transformer for action recognition","authors":"Xiaotian Wang ,&nbsp;Kai Chen ,&nbsp;Zhifu Zhao ,&nbsp;Guangming Shi ,&nbsp;Xuemei Xie ,&nbsp;Xiang Jiang ,&nbsp;Yifan Yang","doi":"10.1016/j.cviu.2024.104229","DOIUrl":"10.1016/j.cviu.2024.104229","url":null,"abstract":"<div><div>Transformer has demonstrated remarkable performance in various computer vision tasks. However, its potential is not fully explored in skeleton-based action recognition. On one hand, existing methods primarily utilize fixed function or pre-learned matrix to encode position information, while overlooking the sample-specific position information. On the other hand, these approaches focus on single-scale spatial relationships, while neglecting the discriminative fine-grained and coarse-grained spatial features. To address these issues, we propose a Multi-Scale Adaptive Skeleton Transformer (MSAST), including Adaptive Skeleton Position Encoding Module (ASPEM), Multi-Scale Embedding Module (MSEM), and Adaptive Relative Location Module (ARLM). ASPEM decouples spatial–temporal information in the position encoding procedure, which acquires inherent dependencies of skeleton sequences. ASPEM is also designed to be dependent on input tokens, which can learn sample-specific position information. The MSEM employs multi-scale pooling to generate multi-scale tokens that contain multi-grained features. Then, the spatial transformer captures multi-scale relations to address the subtle differences between various actions. Another contribution of this paper is that ARLM is presented to mine suitable location information for better recognition performance. Extensive experiments conducted on three benchmark datasets demonstrate that the proposed model achieves Top-1 accuracy of 94.9%/97.5% on NTU-60 C-Sub/C-View, 88.7%/91.6% on NTU-120 X-Sub/X-Set and 97.4% on NW-UCLA, respectively.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104229"},"PeriodicalIF":4.3,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142700218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Open-set domain adaptation with visual-language foundation models 利用视觉语言基础模型进行开放集领域适应性调整
IF 4.3 3区 计算机科学
Computer Vision and Image Understanding Pub Date : 2024-11-19 DOI: 10.1016/j.cviu.2024.104230
Qing Yu , Go Irie , Kiyoharu Aizawa
{"title":"Open-set domain adaptation with visual-language foundation models","authors":"Qing Yu ,&nbsp;Go Irie ,&nbsp;Kiyoharu Aizawa","doi":"10.1016/j.cviu.2024.104230","DOIUrl":"10.1016/j.cviu.2024.104230","url":null,"abstract":"<div><div>Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge obtained from a source domain with labeled data to a target domain with unlabeled data. Owing to the lack of labeled data in the target domain and the possible presence of unknown classes, open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase. Although existing ODA approaches aim to solve the distribution shifts between the source and target domains, most methods fine-tuned ImageNet pre-trained models on the source domain with the adaptation on the target domain. Recent visual-language foundation models (VLFM), such as Contrastive Language-Image Pre-Training (CLIP), are robust to many distribution shifts and, therefore, should substantially improve the performance of ODA. In this work, we explore generic ways to adopt CLIP, a popular VLFM, for ODA. We investigate the performance of zero-shot prediction using CLIP, and then propose an entropy optimization strategy to assist the ODA models with the outputs of CLIP. The proposed approach achieves state-of-the-art results on various benchmarks, demonstrating its effectiveness in addressing the ODA problem.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"250 ","pages":"Article 104230"},"PeriodicalIF":4.3,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142722825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信