伪三元组引导的几幅合成图像检索

arXiv - CS - Multimedia Pub Date : 2024-07-08 DOI:arxiv-2407.06001

Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Xuemeng Song

{"title":"伪三元组引导的几幅合成图像检索","authors":"Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Xuemeng Song","doi":"arxiv-2407.06001","DOIUrl":null,"url":null,"abstract":"Composed Image Retrieval (CIR) is a challenging task that aims to retrieve\nthe target image based on a multimodal query, i.e., a reference image and its\ncorresponding modification text. While previous supervised or zero-shot\nlearning paradigms all fail to strike a good trade-off between time-consuming\nannotation cost and retrieval performance, recent researchers introduced the\ntask of few-shot CIR (FS-CIR) and proposed a textual inversion-based network\nbased on pretrained CLIP model to realize it. Despite its promising\nperformance, the approach suffers from two key limitations: insufficient\nmultimodal query composition training and indiscriminative training triplet\nselection. To address these two limitations, in this work, we propose a novel\ntwo-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the\nfirst stage, we employ a masked training strategy and advanced image caption\ngenerator to construct pseudo triplets from pure image data to enable the model\nto acquire primary knowledge related to multimodal query composition. In the\nsecond stage, based on active learning, we design a pseudo modification\ntext-based query-target distance metric to evaluate the challenging score for\neach unlabeled sample. Meanwhile, we propose a robust top range-based random\nsampling strategy according to the 3-$\\sigma$ rule in statistics, to sample the\nchallenging samples for fine-tuning the pretrained model. Notably, our scheme\nis plug-and-play and compatible with any existing supervised CIR models. We\ntested our scheme across three backbones on three public datasets (i.e.,\nFashionIQ, CIRR, and Birds-to-Words), achieving maximum improvements of 26.4%,\n25.5% and 21.6% respectively, demonstrating our scheme's effectiveness.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pseudo-triplet Guided Few-shot Composed Image Retrieval\",\"authors\":\"Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Xuemeng Song\",\"doi\":\"arxiv-2407.06001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Composed Image Retrieval (CIR) is a challenging task that aims to retrieve\\nthe target image based on a multimodal query, i.e., a reference image and its\\ncorresponding modification text. While previous supervised or zero-shot\\nlearning paradigms all fail to strike a good trade-off between time-consuming\\nannotation cost and retrieval performance, recent researchers introduced the\\ntask of few-shot CIR (FS-CIR) and proposed a textual inversion-based network\\nbased on pretrained CLIP model to realize it. Despite its promising\\nperformance, the approach suffers from two key limitations: insufficient\\nmultimodal query composition training and indiscriminative training triplet\\nselection. To address these two limitations, in this work, we propose a novel\\ntwo-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the\\nfirst stage, we employ a masked training strategy and advanced image caption\\ngenerator to construct pseudo triplets from pure image data to enable the model\\nto acquire primary knowledge related to multimodal query composition. In the\\nsecond stage, based on active learning, we design a pseudo modification\\ntext-based query-target distance metric to evaluate the challenging score for\\neach unlabeled sample. Meanwhile, we propose a robust top range-based random\\nsampling strategy according to the 3-$\\\\sigma$ rule in statistics, to sample the\\nchallenging samples for fine-tuning the pretrained model. Notably, our scheme\\nis plug-and-play and compatible with any existing supervised CIR models. We\\ntested our scheme across three backbones on three public datasets (i.e.,\\nFashionIQ, CIRR, and Birds-to-Words), achieving maximum improvements of 26.4%,\\n25.5% and 21.6% respectively, demonstrating our scheme's effectiveness.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.06001\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.06001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

合成图像检索（CIR）是一项具有挑战性的任务，旨在根据多模态查询（即参考图像及其对应的修改文本）检索目标图像。以往的监督学习或零镜头学习范式都无法在耗时的标注成本和检索性能之间取得良好的平衡，而最近的研究人员引入了少镜头 CIR（FS-CIR）任务，并提出了一种基于预训练 CLIP 模型的文本反转网络来实现该任务。尽管该方法具有良好的性能，但也存在两个主要局限：多模态查询组成训练不足和训练三元组选择不加区分。为了解决这两个局限性，我们在这项工作中提出了一种新颖的两阶段伪三元组引导的少量 CIR 方案，称为 PTG-FSCIR。在第一阶段，我们采用屏蔽训练策略和先进的图像标题生成器，从纯图像数据中构建伪三元组，使模型能够获取与多模态查询组成相关的主要知识。第二阶段，在主动学习的基础上，我们设计了一种基于伪修改文本的查询-目标距离度量，以评估每个未标记样本的挑战性得分。同时，我们根据统计学中的 3-$\sigma$ 规则，提出了一种基于顶部范围的鲁棒随机抽样策略，用于抽取挑战样本，以微调预训练模型。值得注意的是，我们的方案即插即用，可与任何现有的有监督 CIR 模型兼容。我们在三个公共数据集（即 FashionIQ、CIRR 和 Birds-to-Words ）的三个骨干网上测试了我们的方案，分别取得了 26.4%、25.5% 和 21.6% 的最大改进，证明了我们方案的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Pseudo-triplet Guided Few-shot Composed Image Retrieval

Composed Image Retrieval (CIR) is a challenging task that aims to retrieve the target image based on a multimodal query, i.e., a reference image and its corresponding modification text. While previous supervised or zero-shot learning paradigms all fail to strike a good trade-off between time-consuming annotation cost and retrieval performance, recent researchers introduced the task of few-shot CIR (FS-CIR) and proposed a textual inversion-based network based on pretrained CLIP model to realize it. Despite its promising performance, the approach suffers from two key limitations: insufficient multimodal query composition training and indiscriminative training triplet selection. To address these two limitations, in this work, we propose a novel two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the first stage, we employ a masked training strategy and advanced image caption generator to construct pseudo triplets from pure image data to enable the model to acquire primary knowledge related to multimodal query composition. In the second stage, based on active learning, we design a pseudo modification text-based query-target distance metric to evaluate the challenging score for each unlabeled sample. Meanwhile, we propose a robust top range-based random sampling strategy according to the 3-$\sigma$ rule in statistics, to sample the challenging samples for fine-tuning the pretrained model. Notably, our scheme is plug-and-play and compatible with any existing supervised CIR models. We tested our scheme across three backbones on three public datasets (i.e., FashionIQ, CIRR, and Birds-to-Words), achieving maximum improvements of 26.4%, 25.5% and 21.6% respectively, demonstrating our scheme's effectiveness.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量