Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Xuemeng Song
{"title":"伪三元组引导的几幅合成图像检索","authors":"Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Xuemeng Song","doi":"arxiv-2407.06001","DOIUrl":null,"url":null,"abstract":"Composed Image Retrieval (CIR) is a challenging task that aims to retrieve\nthe target image based on a multimodal query, i.e., a reference image and its\ncorresponding modification text. While previous supervised or zero-shot\nlearning paradigms all fail to strike a good trade-off between time-consuming\nannotation cost and retrieval performance, recent researchers introduced the\ntask of few-shot CIR (FS-CIR) and proposed a textual inversion-based network\nbased on pretrained CLIP model to realize it. Despite its promising\nperformance, the approach suffers from two key limitations: insufficient\nmultimodal query composition training and indiscriminative training triplet\nselection. To address these two limitations, in this work, we propose a novel\ntwo-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the\nfirst stage, we employ a masked training strategy and advanced image caption\ngenerator to construct pseudo triplets from pure image data to enable the model\nto acquire primary knowledge related to multimodal query composition. In the\nsecond stage, based on active learning, we design a pseudo modification\ntext-based query-target distance metric to evaluate the challenging score for\neach unlabeled sample. Meanwhile, we propose a robust top range-based random\nsampling strategy according to the 3-$\\sigma$ rule in statistics, to sample the\nchallenging samples for fine-tuning the pretrained model. Notably, our scheme\nis plug-and-play and compatible with any existing supervised CIR models. We\ntested our scheme across three backbones on three public datasets (i.e.,\nFashionIQ, CIRR, and Birds-to-Words), achieving maximum improvements of 26.4%,\n25.5% and 21.6% respectively, demonstrating our scheme's effectiveness.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pseudo-triplet Guided Few-shot Composed Image Retrieval\",\"authors\":\"Bohan Hou, Haoqiang Lin, Haokun Wen, Meng Liu, Xuemeng Song\",\"doi\":\"arxiv-2407.06001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Composed Image Retrieval (CIR) is a challenging task that aims to retrieve\\nthe target image based on a multimodal query, i.e., a reference image and its\\ncorresponding modification text. While previous supervised or zero-shot\\nlearning paradigms all fail to strike a good trade-off between time-consuming\\nannotation cost and retrieval performance, recent researchers introduced the\\ntask of few-shot CIR (FS-CIR) and proposed a textual inversion-based network\\nbased on pretrained CLIP model to realize it. Despite its promising\\nperformance, the approach suffers from two key limitations: insufficient\\nmultimodal query composition training and indiscriminative training triplet\\nselection. To address these two limitations, in this work, we propose a novel\\ntwo-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the\\nfirst stage, we employ a masked training strategy and advanced image caption\\ngenerator to construct pseudo triplets from pure image data to enable the model\\nto acquire primary knowledge related to multimodal query composition. In the\\nsecond stage, based on active learning, we design a pseudo modification\\ntext-based query-target distance metric to evaluate the challenging score for\\neach unlabeled sample. Meanwhile, we propose a robust top range-based random\\nsampling strategy according to the 3-$\\\\sigma$ rule in statistics, to sample the\\nchallenging samples for fine-tuning the pretrained model. Notably, our scheme\\nis plug-and-play and compatible with any existing supervised CIR models. We\\ntested our scheme across three backbones on three public datasets (i.e.,\\nFashionIQ, CIRR, and Birds-to-Words), achieving maximum improvements of 26.4%,\\n25.5% and 21.6% respectively, demonstrating our scheme's effectiveness.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.06001\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.06001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Composed Image Retrieval (CIR) is a challenging task that aims to retrieve
the target image based on a multimodal query, i.e., a reference image and its
corresponding modification text. While previous supervised or zero-shot
learning paradigms all fail to strike a good trade-off between time-consuming
annotation cost and retrieval performance, recent researchers introduced the
task of few-shot CIR (FS-CIR) and proposed a textual inversion-based network
based on pretrained CLIP model to realize it. Despite its promising
performance, the approach suffers from two key limitations: insufficient
multimodal query composition training and indiscriminative training triplet
selection. To address these two limitations, in this work, we propose a novel
two-stage pseudo triplet guided few-shot CIR scheme, dubbed PTG-FSCIR. In the
first stage, we employ a masked training strategy and advanced image caption
generator to construct pseudo triplets from pure image data to enable the model
to acquire primary knowledge related to multimodal query composition. In the
second stage, based on active learning, we design a pseudo modification
text-based query-target distance metric to evaluate the challenging score for
each unlabeled sample. Meanwhile, we propose a robust top range-based random
sampling strategy according to the 3-$\sigma$ rule in statistics, to sample the
challenging samples for fine-tuning the pretrained model. Notably, our scheme
is plug-and-play and compatible with any existing supervised CIR models. We
tested our scheme across three backbones on three public datasets (i.e.,
FashionIQ, CIRR, and Birds-to-Words), achieving maximum improvements of 26.4%,
25.5% and 21.6% respectively, demonstrating our scheme's effectiveness.