Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord
{"title":"LLM-wrapper:视觉语言基础模型的黑盒语义感知适配","authors":"Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord","doi":"arxiv-2409.11919","DOIUrl":null,"url":null,"abstract":"Vision Language Models (VLMs) have shown impressive performances on numerous\ntasks but their zero-shot capabilities can be limited compared to dedicated or\nfine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires\n`white-box' access to the model's architecture and weights as well as expertise\nto design the fine-tuning objectives and optimize the hyper-parameters, which\nare specific to each VLM and downstream task. In this work, we propose\nLLM-wrapper, a novel approach to adapt VLMs in a `black-box' manner by\nleveraging large language models (LLMs) so as to reason on their outputs. We\ndemonstrate the effectiveness of LLM-wrapper on Referring Expression\nComprehension (REC), a challenging open-vocabulary task that requires spatial\nand semantic reasoning. Our approach significantly boosts the performance of\noff-the-shelf models, resulting in competitive results when compared with\nclassic fine-tuning.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models\",\"authors\":\"Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord\",\"doi\":\"arxiv-2409.11919\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vision Language Models (VLMs) have shown impressive performances on numerous\\ntasks but their zero-shot capabilities can be limited compared to dedicated or\\nfine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires\\n`white-box' access to the model's architecture and weights as well as expertise\\nto design the fine-tuning objectives and optimize the hyper-parameters, which\\nare specific to each VLM and downstream task. In this work, we propose\\nLLM-wrapper, a novel approach to adapt VLMs in a `black-box' manner by\\nleveraging large language models (LLMs) so as to reason on their outputs. We\\ndemonstrate the effectiveness of LLM-wrapper on Referring Expression\\nComprehension (REC), a challenging open-vocabulary task that requires spatial\\nand semantic reasoning. Our approach significantly boosts the performance of\\noff-the-shelf models, resulting in competitive results when compared with\\nclassic fine-tuning.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11919\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11919","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models
Vision Language Models (VLMs) have shown impressive performances on numerous
tasks but their zero-shot capabilities can be limited compared to dedicated or
fine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires
`white-box' access to the model's architecture and weights as well as expertise
to design the fine-tuning objectives and optimize the hyper-parameters, which
are specific to each VLM and downstream task. In this work, we propose
LLM-wrapper, a novel approach to adapt VLMs in a `black-box' manner by
leveraging large language models (LLMs) so as to reason on their outputs. We
demonstrate the effectiveness of LLM-wrapper on Referring Expression
Comprehension (REC), a challenging open-vocabulary task that requires spatial
and semantic reasoning. Our approach significantly boosts the performance of
off-the-shelf models, resulting in competitive results when compared with
classic fine-tuning.