Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord
{"title":"LLM-wrapper: Black-Box Semantic-Aware Adaptation of Vision-Language Foundation Models","authors":"Amaia Cardiel, Eloi Zablocki, Oriane Siméoni, Elias Ramzi, Matthieu Cord","doi":"arxiv-2409.11919","DOIUrl":null,"url":null,"abstract":"Vision Language Models (VLMs) have shown impressive performances on numerous\ntasks but their zero-shot capabilities can be limited compared to dedicated or\nfine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires\n`white-box' access to the model's architecture and weights as well as expertise\nto design the fine-tuning objectives and optimize the hyper-parameters, which\nare specific to each VLM and downstream task. In this work, we propose\nLLM-wrapper, a novel approach to adapt VLMs in a `black-box' manner by\nleveraging large language models (LLMs) so as to reason on their outputs. We\ndemonstrate the effectiveness of LLM-wrapper on Referring Expression\nComprehension (REC), a challenging open-vocabulary task that requires spatial\nand semantic reasoning. Our approach significantly boosts the performance of\noff-the-shelf models, resulting in competitive results when compared with\nclassic fine-tuning.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11919","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Vision Language Models (VLMs) have shown impressive performances on numerous
tasks but their zero-shot capabilities can be limited compared to dedicated or
fine-tuned models. Yet, fine-tuning VLMs comes with limitations as it requires
`white-box' access to the model's architecture and weights as well as expertise
to design the fine-tuning objectives and optimize the hyper-parameters, which
are specific to each VLM and downstream task. In this work, we propose
LLM-wrapper, a novel approach to adapt VLMs in a `black-box' manner by
leveraging large language models (LLMs) so as to reason on their outputs. We
demonstrate the effectiveness of LLM-wrapper on Referring Expression
Comprehension (REC), a challenging open-vocabulary task that requires spatial
and semantic reasoning. Our approach significantly boosts the performance of
off-the-shelf models, resulting in competitive results when compared with
classic fine-tuning.