Exploring Effective Factors for Improving Visual In-Context Learning

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-31 DOI:10.1109/TIP.2025.3554410

Yanpeng Sun;Qiang Chen;Jian Wang;Jingdong Wang;Zechao Li

{"title":"Exploring Effective Factors for Improving Visual In-Context Learning","authors":"Yanpeng Sun;Qiang Chen;Jian Wang;Jingdong Wang;Zechao Li","doi":"10.1109/TIP.2025.3554410","DOIUrl":null,"url":null,"abstract":"The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that Prompt Selection and Prompt Fusion are two major factors that have a direct impact on the inference performance of visual in-context learning. Prompt selection is the process of selecting the most suitable prompt for query image. This is crucial because high-quality prompts assist large-scale visual models in rapidly and accurately comprehending new tasks. Prompt fusion involves combining prompts and query images to activate knowledge within large-scale visual models. However, altering the prompt fusion method significantly impacts its performance on new tasks. Based on these findings, we propose a simple framework prompt-SelF to improve visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate diverse knowledge stored in the large-scale vision model, and finally, ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. We conducted extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, prompt-SelF has outperformed OSLSM method-based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at <uri>https://github.com/syp2ysy/prompt-SelF</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2147-2160"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10945944/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The In-Context Learning (ICL) is to understand a new task via a few demonstrations (aka. prompt) and predict new inputs without tuning the models. While it has been widely studied in NLP, it is still a relatively new area of research in computer vision. To reveal the factors influencing the performance of visual in-context learning, this paper shows that Prompt Selection and Prompt Fusion are two major factors that have a direct impact on the inference performance of visual in-context learning. Prompt selection is the process of selecting the most suitable prompt for query image. This is crucial because high-quality prompts assist large-scale visual models in rapidly and accurately comprehending new tasks. Prompt fusion involves combining prompts and query images to activate knowledge within large-scale visual models. However, altering the prompt fusion method significantly impacts its performance on new tasks. Based on these findings, we propose a simple framework prompt-SelF to improve visual in-context learning. Specifically, we first use the pixel-level retrieval method to select a suitable prompt, and then use different prompt fusion methods to activate diverse knowledge stored in the large-scale vision model, and finally, ensemble the prediction results obtained from different prompt fusion methods to obtain the final prediction results. We conducted extensive experiments on single-object segmentation and detection tasks to demonstrate the effectiveness of prompt-SelF. Remarkably, prompt-SelF has outperformed OSLSM method-based meta-learning in 1-shot segmentation for the first time. This indicated the great potential of visual in-context learning. The source code and models will be available at https://github.com/syp2ysy/prompt-SelF.

查看原文本刊更多论文

探索提高视觉情境学习的有效因素

情境学习（ICL）是通过一些演示来理解一个新任务。提示)并预测新的输入，而无需调整模型。虽然它在自然语言处理中得到了广泛的研究，但它在计算机视觉中仍然是一个相对较新的研究领域。为了揭示影响视觉语境学习表现的因素，本文表明提示选择和提示融合是直接影响视觉语境学习推理表现的两个主要因素。提示选择是为查询图像选择最合适的提示的过程。这是至关重要的，因为高质量的提示有助于大规模视觉模型快速准确地理解新任务。提示融合包括将提示和查询图像结合起来，在大规模视觉模型中激活知识。然而，改变提示融合方法会显著影响其在新任务中的性能。基于这些发现，我们提出了一个简单的提示自我框架来提高视觉语境学习。具体而言，我们首先使用像素级检索方法选择合适的提示，然后使用不同的提示融合方法激活存储在大尺度视觉模型中的多种知识，最后将不同提示融合方法获得的预测结果进行集成，得到最终的预测结果。我们在单目标分割和检测任务上进行了大量的实验，以证明prompt-SelF的有效性。值得注意的是，prompt-SelF在单次分割中首次优于基于OSLSM方法的元学习。这表明了视觉情境学习的巨大潜力。源代码和模型可以在https://github.com/syp2ysy/prompt-SelF上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量