International Journal of Computer Vision最新文献

筛选
英文 中文
Contextual Object Detection with Multimodal Large Language Models 利用多模态大语言模型进行上下文物体检测
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-08-20 DOI: 10.1007/s11263-024-02214-4
Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, Chen Change Loy
{"title":"Contextual Object Detection with Multimodal Large Language Models","authors":"Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, Chen Change Loy","doi":"10.1007/s11263-024-02214-4","DOIUrl":"https://doi.org/10.1007/s11263-024-02214-4","url":null,"abstract":"<p>Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, <i>i.e.</i>, object detection. In this work, we address this limitation by introducing a novel research problem of <i>contextual object detection</i>—understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new <i>generate-then-detect</i> framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"128 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142013743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-training 由视觉和视觉语言预训练引导的无源领域自适应
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-08-20 DOI: 10.1007/s11263-024-02215-3
Wenyu Zhang, Li Shen, Chuan-Sheng Foo
{"title":"Source-Free Domain Adaptation Guided by Vision and Vision-Language Pre-training","authors":"Wenyu Zhang, Li Shen, Chuan-Sheng Foo","doi":"10.1007/s11263-024-02215-3","DOIUrl":"https://doi.org/10.1007/s11263-024-02215-3","url":null,"abstract":"<p>Source-free domain adaptation (SFDA) aims to adapt a source model trained on a fully-labeled source domain to a related but unlabeled target domain. While the source model is a key avenue for acquiring target pseudolabels, the generated pseudolabels may exhibit source bias. In the conventional SFDA pipeline, a large data (e.g. ImageNet) pre-trained feature extractor is used to initialize the source model at the start of source training, and subsequently discarded. Despite having diverse features important for generalization, the pre-trained feature extractor can overfit to the source data distribution during source training and forget relevant target domain knowledge. Rather than discarding this valuable knowledge, we introduce an integrated framework to incorporate pre-trained networks into the target adaptation process. The proposed framework is flexible and allows us to plug modern pre-trained networks into the adaptation process to leverage their stronger representation learning capabilities. For adaptation, we propose the <i>Co-learn</i> algorithm to improve target pseudolabel quality collaboratively through the source model and a pre-trained feature extractor. Building on the recent success of the vision-language model CLIP in zero-shot image recognition, we present an extension <i>Co-learn</i><span>++</span> to further incorporate CLIP’s zero-shot classification decisions. We evaluate on 4 benchmark datasets and include more challenging scenarios such as open-set, partial-set and open-partial SFDA. Experimental results demonstrate that our proposed strategy improves adaptation performance and can be successfully integrated with existing SFDA methods.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"41 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142007517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLMFormer: Large Language Model for Open-Vocabulary Semantic Segmentation LLMFormer:用于开放词汇语义分割的大型语言模型
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-08-16 DOI: 10.1007/s11263-024-02171-y
Hengcan Shi, Son Duy Dao, Jianfei Cai
{"title":"LLMFormer: Large Language Model for Open-Vocabulary Semantic Segmentation","authors":"Hengcan Shi, Son Duy Dao, Jianfei Cai","doi":"10.1007/s11263-024-02171-y","DOIUrl":"https://doi.org/10.1007/s11263-024-02171-y","url":null,"abstract":"<p>Open-vocabulary (OV) semantic segmentation has attracted increasing attention in recent years, which aims to recognize objects in an open class set for real-world applications. While prior OV semantic segmentation approaches have relied on additional semantic knowledge derived from vision-language (VL) pre-training, such as the popular CLIP model, this paper introduces a novel paradigm by harnessing the unprecedented capabilities of large language models (LLMs). Inspired by recent breakthroughs in LLMs that provide a richer knowledge base compared to traditional vision-language pre-training, our proposed methodology capitalizes on the vast knowledge embedded within LLMs for OV semantic segmentation. Particularly, we partition LLM knowledge into object, attribute, and relation priors, and propose three novel attention modules-semantic, scaled visual, and relation attentions, to utilize the LLM priors. Extensive experiments are conducted on common benchmarks including ADE20K (847 classes) and Pascal Context (459 classes). The results show that our model outperforms previous state-of-the-art (SoTA) methods by up to 7.2% absolute. Moreover, unlike previous VL-pre-training-based works, our method can even predict OV segmentation results without target candidate classes.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"32 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141994508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Masked Channel Modeling for Bootstrapping Visual Pre-training 用于引导视觉预训练的屏蔽通道建模
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-08-16 DOI: 10.1007/s11263-024-02204-6
Yang Liu, Xinlong Wang, Muzhi Zhu, Yue Cao, Tiejun Huang, Chunhua Shen
{"title":"Masked Channel Modeling for Bootstrapping Visual Pre-training","authors":"Yang Liu, Xinlong Wang, Muzhi Zhu, Yue Cao, Tiejun Huang, Chunhua Shen","doi":"10.1007/s11263-024-02204-6","DOIUrl":"https://doi.org/10.1007/s11263-024-02204-6","url":null,"abstract":"<p>Large vision models have achieved great success in computer vision recently, e.g., CLIP for large-scale image-text contrastive learning. They have prominent potential in representation learning and show strong transfer ability in various downstream tasks. However, directly training a larger CLIP model from scratch is difficult because of the enormous training cost, unstable training, and difficulty in collecting a large amount of training data. In this work, we aim to scale the sizes of CLIP models and extend their strong capabilities with self-supervised representation learning. We introduce masked channel modeling (MCM), a new self-supervised learning framework that randomly masks the input feature maps extracted by a CLIP model and reconstructs the missing features. Unlike masked image modeling (MIM) which takes raw pixels as the input and output, MCM performs masked modeling at a high-dimensional semantic space by masking random channels of the visual features and reconstructing the corrupted channels. We show that channel maps are a great fit for masked modeling, as the visual features are semantically structured across channels. We demonstrate that our method can easily scale up the CLIP model at a low training cost, and extend its capabilities on zero-shot learning, few-shot learning, and end-to-end fine-tuning. Based on CLIP ViT-L, MCM improves the zero-shot image classification accuracy by 0.5% averaged over 8 benchmarks. With a few samples, e.g., 1-shot or 2-shot, MCM achieves significant improvements when adapting to 11 image classification benchmarks. In addition, MCM shows strong performance when end-to-end fine-tuned on different downstream tasks, e.g., improving CLIP ViT-B by 0.9% top-1 accuracy on ImageNet-1K classification and 2.5% mIoU on ADE20K semantic segmentation.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"6 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141994391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Out-of-Distribution Detection with Virtual Outlier Smoothing 利用虚拟离群点平滑技术进行离群点检测
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-08-14 DOI: 10.1007/s11263-024-02210-8
Jun Nie, Yadan Luo, Shanshan Ye, Yonggang Zhang, Xinmei Tian, Zhen Fang
{"title":"Out-of-Distribution Detection with Virtual Outlier Smoothing","authors":"Jun Nie, Yadan Luo, Shanshan Ye, Yonggang Zhang, Xinmei Tian, Zhen Fang","doi":"10.1007/s11263-024-02210-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02210-8","url":null,"abstract":"<p>Detecting out-of-distribution (OOD) inputs plays a crucial role in guaranteeing the reliability of deep neural networks (DNNs) when deployed in real-world scenarios. However, DNNs typically exhibit overconfidence in OOD samples, which is attributed to the similarity in patterns between OOD and in-distribution (ID) samples. To mitigate this overconfidence, advanced approaches suggest the incorporation of auxiliary OOD samples during model training, where the outliers are assigned with an equal likelihood of belonging to any category. However, identifying outliers that share patterns with ID samples poses a significant challenge. To address the challenge, we propose a novel method, <u>V</u>irtual <u>O</u>utlier <u>S</u>m<u>o</u>othing (VOSo), which constructs auxiliary outliers using ID samples, thereby eliminating the need to search for OOD samples. Specifically, VOSo creates these virtual outliers by perturbing the semantic regions of ID samples and infusing patterns from other ID samples. For instance, a virtual outlier might consist of a cat’s face with a dog’s nose, where the cat’s face serves as the semantic feature for model prediction. Meanwhile, VOSo adjusts the labels of virtual OOD samples based on the extent of semantic region perturbation, aligning with the notion that virtual outliers may contain ID patterns. Extensive experiments are conducted on diverse OOD detection benchmarks, demonstrating the effectiveness of the proposed VOSo. Our code will be available at https://github.com/junz-debug/VOSo.\u0000</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"5 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141980951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Test-time Forgery Detection with Spatial-Frequency Prompt Learning 利用空间频率提示学习进行测试时间伪造检测
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-08-13 DOI: 10.1007/s11263-024-02208-2
Junxian Duan, Yuang Ai, Jipeng Liu, Shenyuan Huang, Huaibo Huang, Jie Cao, Ran He
{"title":"Test-time Forgery Detection with Spatial-Frequency Prompt Learning","authors":"Junxian Duan, Yuang Ai, Jipeng Liu, Shenyuan Huang, Huaibo Huang, Jie Cao, Ran He","doi":"10.1007/s11263-024-02208-2","DOIUrl":"https://doi.org/10.1007/s11263-024-02208-2","url":null,"abstract":"<p>The significance of face forgery detection has grown substantially due to the emergence of facial manipulation technologies. Recent methods have turned to face detection forgery in the spatial-frequency domain, resulting in improved overall performance. Nonetheless, these methods are still not guaranteed to cover various forgery technologies, and the networks trained on public datasets struggle to accurately quantify their uncertainty levels. In this work, we design a Dynamic Dual-spectrum Interaction Network that allows test-time training with uncertainty guidance and spatial-frequency prompt learning. RGB and frequency features are first interacted in multi-level by using a Frequency-guided Attention Module. Then these multi-modal features are merged with a Dynamic Fusion Module. As a bias in the fusion weight of uncertain data during dynamic fusion, we further exploit uncertain perturbation as guidance during the test-time training phase. Furthermore, we propose a spatial-frequency prompt learning method to effectively enhance the generalization of the forgery detection model. Finally, we curate a novel, extensive dataset containing images synthesized by various diffusion and non-diffusion methods. Comprehensive evaluations of experiments show that our method achieves more appealing results for face forgery detection than recent state-of-the-art methods.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"52 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141980948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Self-supervised Scalable Deep Compressed Sensing 自监督可扩展深度压缩传感
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-08-13 DOI: 10.1007/s11263-024-02209-1
Bin Chen, Xuanyu Zhang, Shuai Liu, Yongbing Zhang, Jian Zhang
{"title":"Self-supervised Scalable Deep Compressed Sensing","authors":"Bin Chen, Xuanyu Zhang, Shuai Liu, Yongbing Zhang, Jian Zhang","doi":"10.1007/s11263-024-02209-1","DOIUrl":"https://doi.org/10.1007/s11263-024-02209-1","url":null,"abstract":"<p>Compressed sensing (CS) is a promising tool for reducing sampling costs. Current deep neural network (NN)-based CS approaches face the challenges of collecting labeled measurement-ground truth (GT) data and generalizing to real applications. This paper proposes a novel <b>S</b>elf-supervised s<b>C</b>alable deep CS method, comprising a deep <b>L</b>earning scheme called <b>SCL</b> and a family of <b>Net</b>works named <b>SCNet</b>, which does not require GT and can handle arbitrary sampling ratios and matrices once trained on a partial measurement set. Our SCL contains a dual-domain loss and a four-stage recovery strategy. The former encourages a cross-consistency on two measurement parts and a sampling-reconstruction cycle-consistency regarding arbitrary ratios and matrices to maximize data utilization. The latter can progressively leverage the common signal prior in external measurements and internal characteristics of test samples and learned NNs to improve accuracy. SCNet combines both the explicit guidance from optimization algorithms and the implicit regularization from advanced NN blocks to learn a collaborative signal representation. Our theoretical analyses and experiments on simulated and real captured data, covering 1-/2-/3-D natural and scientific signals, demonstrate the effectiveness, superior performance, flexibility, and generalization ability of our method over existing self-supervised methods and its significant potential in competing against many state-of-the-art supervised methods. Code is available at https://github.com/Guaishou74851/SCNet.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"142 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141980949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding FMGS:用于整体三维场景理解的嵌入式三维高斯拼接基础模型
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-08-12 DOI: 10.1007/s11263-024-02183-8
Xingxing Zuo, Pouya Samangouei, Yunwen Zhou, Yan Di, Mingyang Li
{"title":"FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding","authors":"Xingxing Zuo, Pouya Samangouei, Yunwen Zhou, Yan Di, Mingyang Li","doi":"10.1007/s11263-024-02183-8","DOIUrl":"https://doi.org/10.1007/s11263-024-02183-8","url":null,"abstract":"<p>Precisely perceiving the geometric and semantic properties of real-world 3D objects is crucial for the continued evolution of augmented reality and robotic applications. To this end, we present Foundation Model Embedded Gaussian Splatting (FMGS), which incorporates vision-language embeddings of foundation models into 3D Gaussian Splatting (GS). The key contribution of this work is an efficient method to reconstruct and represent 3D vision-language models. This is achieved by distilling feature maps generated from image-based foundation models into those rendered from our 3D model. To ensure high-quality rendering and fast training, we introduce a novel scene representation by integrating strengths from both GS and multi-resolution hash encodings (MHE). Our effective training procedure also introduces a pixel alignment loss that makes the rendered feature distance of same semantic entities close, following the pixel-level semantic boundaries. Our results demonstrate remarkable multi-view semantic consistency, facilitating diverse downstream tasks, beating state-of-the-art methods by <span>({10.2})</span>object detection, despite that we are <span>({851times })</span> faster for inference. This research explores the intersection of vision, language, and 3D scene representation, paving the way for enhanced scene understanding in uncontrolled real-world environments. We plan to release the code on the [project page].</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"368 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141918837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking Open-World DeepFake Attribution with Multi-perspective Sensory Learning 利用多视角感官学习反思开放世界深度假货归属问题
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-08-12 DOI: 10.1007/s11263-024-02184-7
Zhimin Sun, Shen Chen, Taiping Yao, Ran Yi, Shouhong Ding, Lizhuang Ma
{"title":"Rethinking Open-World DeepFake Attribution with Multi-perspective Sensory Learning","authors":"Zhimin Sun, Shen Chen, Taiping Yao, Ran Yi, Shouhong Ding, Lizhuang Ma","doi":"10.1007/s11263-024-02184-7","DOIUrl":"https://doi.org/10.1007/s11263-024-02184-7","url":null,"abstract":"<p>The challenge in sourcing attribution for forgery faces has gained widespread attention due to the rapid development of generative techniques. While many recent works have taken essential steps on GAN-generated faces, more threatening attacks related to identity swapping or diffusion models are still overlooked. And the forgery traces hidden in unknown attacks from the open-world unlabeled faces remain under-explored. To push the related frontier research, we introduce a novel task named Open-World DeepFake Attribution, and the corresponding benchmark OW-DFA++, which aims to evaluate attribution performance against various types of fake faces in open-world scenarios. Meanwhile, we propose a Multi-Perspective Sensory Learning (MPSL) framework that aims to address the challenge of OW-DFA++. Since different forged faces have different tampering regions and frequency artifacts, we introduce the Multi-Perception Voting (MPV) module, which aligns inter-sample features based on global, multi-scale local, and frequency relations. The MPV module effectively filters and groups together samples belonging to the same attack type. Pseudo-labeling is another common and effective strategy in semi-supervised learning tasks, and we propose the Confidence-Adaptive Pseudo-labeling (CAP) module, using soft pseudo-labeling to enhance the class compactness and mitigate pseudo-noise induced by similar novel attack methods. The CAP module imposes strong constraints and adaptively filters samples with high uncertainty to improve the accuracy of the pseudo-labeling. In addition, we extend the MPSL framework with a multi-stage paradigm that leverages pre-train technique and iterative learning to further enhance traceability performance. Extensive experiments and visualizations verify the superiority of our proposed method on the OW-DFA++ and demonstrate the interpretability of the deepfake attribution task and its impact on improving the security of the deepfake detection area.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"191 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141918839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variational Rectification Inference for Learning with Noisy Labels 利用噪声标签进行学习的变分整型推理
IF 19.5 2区 计算机科学
International Journal of Computer Vision Pub Date : 2024-08-12 DOI: 10.1007/s11263-024-02205-5
Haoliang Sun, Qi Wei, Lei Feng, Yupeng Hu, Fan Liu, Hehe Fan, Yilong Yin
{"title":"Variational Rectification Inference for Learning with Noisy Labels","authors":"Haoliang Sun, Qi Wei, Lei Feng, Yupeng Hu, Fan Liu, Hehe Fan, Yilong Yin","doi":"10.1007/s11263-024-02205-5","DOIUrl":"https://doi.org/10.1007/s11263-024-02205-5","url":null,"abstract":"<p>Label noise has been broadly observed in real-world datasets. To mitigate the negative impact of overfitting to label noise for deep models, effective strategies (e.g., re-weighting, or loss rectification) have been broadly applied in prevailing approaches, which have been generally learned under the meta-learning scenario. Despite the robustness of noise achieved by the probabilistic meta-learning models, they usually suffer from model collapse that degenerates generalization performance. In this paper, we propose variational rectification inference (VRI) to formulate the adaptive rectification for loss functions as an amortized variational inference problem and derive the evidence lower bound under the meta-learning framework. Specifically, VRI is constructed as a hierarchical Bayes by treating the rectifying vector as a latent variable, which can rectify the loss of the noisy sample with the extra randomness regularization and is, therefore, more robust to label noise. To achieve the inference of the rectifying vector, we approximate its conditional posterior with an amortization meta-network. By introducing the variational term in VRI, the conditional posterior is estimated accurately and avoids collapsing to a Dirac delta function, which can significantly improve the generalization performance. The elaborated meta-network and prior network adhere to the smoothness assumption, enabling the generation of reliable rectification vectors. Given a set of clean meta-data, VRI can be efficiently meta-learned within the bi-level optimization programming. Besides, theoretical analysis guarantees that the meta-network can be efficiently learned with our algorithm. Comprehensive comparison experiments and analyses validate its effectiveness for robust learning with noisy labels, particularly in the presence of open-set noise.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"44 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141918878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信