E-InMeMo: Enhanced Prompting for Visual In-Context Learning.

IF 2.7 Q3 IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY

Journal of Imaging Pub Date : 2025-07-11 DOI:10.3390/jimaging11070232

Jiahao Zhang, Bowen Wang, Hong Liu, Liangzhi Li, Yuta Nakashima, Hajime Nagahara

{"title":"E-InMeMo: Enhanced Prompting for Visual In-Context Learning.","authors":"Jiahao Zhang, Bowen Wang, Hong Liu, Liangzhi Li, Yuta Nakashima, Hajime Nagahara","doi":"10.3390/jimaging11070232","DOIUrl":null,"url":null,"abstract":"Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct MeMore (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL.","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"11 7","pages":""},"PeriodicalIF":2.7000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12295390/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/jimaging11070232","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"IMAGING SCIENCE & PHOTOGRAPHIC TECHNOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Large-scale models trained on extensive datasets have become the standard due to their strong generalizability across diverse tasks. In-context learning (ICL), widely used in natural language processing, leverages these models by providing task-specific prompts without modifying their parameters. This paradigm is increasingly being adapted for computer vision, where models receive an input-output image pair, known as an in-context pair, alongside a query image to illustrate the desired output. However, the success of visual ICL largely hinges on the quality of these prompts. To address this, we propose Enhanced Instruct MeMore (E-InMeMo), a novel approach that incorporates learnable perturbations into in-context pairs to optimize prompting. Through extensive experiments on standard vision tasks, E-InMeMo demonstrates superior performance over existing state-of-the-art methods. Notably, it improves mIoU scores by 7.99 for foreground segmentation and by 17.04 for single object detection when compared to the baseline without learnable prompts. These results highlight E-InMeMo as a lightweight yet effective strategy for enhancing visual ICL.

Abstract Image

查看原文本刊更多论文

E-InMeMo：视觉情境学习的增强提示。

在广泛的数据集上训练的大规模模型由于其在不同任务中的强泛化性而成为标准。在自然语言处理中广泛使用的上下文学习（ICL）利用这些模型，在不修改其参数的情况下提供特定于任务的提示。这种范式越来越多地用于计算机视觉，其中模型接收输入输出图像对，称为上下文对，以及查询图像来说明所需的输出。然而，可视化ICL的成功很大程度上取决于这些提示符的质量。为了解决这个问题，我们提出了增强型指令记忆（E-InMeMo），这是一种将可学习的扰动纳入上下文对以优化提示的新方法。通过对标准视觉任务的大量实验，E-InMeMo展示了优于现有最先进方法的性能。值得注意的是，与没有可学习提示的基线相比，它在前景分割方面提高了7.99分，在单个对象检测方面提高了17.04分。这些结果强调E-InMeMo是一种轻量级但有效的增强视觉ICL的策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊