Prompt-based Weakly-supervised Vision-language Pre-training

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Letters Pub Date : 2025-07-09 DOI:10.1016/j.patrec.2025.06.020

Zixin Guo , Tzu-Jui Julius Wang , Selen Pehlivan , Abduljalil Radman , Min Cao , Jorma Laaksonen

{"title":"Prompt-based Weakly-supervised Vision-language Pre-training","authors":"Zixin Guo , Tzu-Jui Julius Wang , Selen Pehlivan , Abduljalil Radman , Min Cao , Jorma Laaksonen","doi":"10.1016/j.patrec.2025.06.020","DOIUrl":null,"url":null,"abstract":"<div><div>Weakly-supervised Vision-Language Pre-training (W-VLP) explores methods leveraging weak cross-modal supervision, typically relying on object tags generated by a pre-trained object detector (OD) from images. However, training such an OD necessitates dense cross-modal information, including images paired with numerous object-level annotations. To alleviate that requirement, this paper addresses W-VLP in two stages: (1) creating data with weaker cross-modal supervision and (2) pre-training a vision-language (VL) model with the created data. The data creation process involves collecting knowledge from large language models (LLMs) to describe images. Given a category label of an image, its descriptions generated by an LLM are used as the language counterpart. This knowledge supplements what can be obtained using an OD, such as spatial relationships among objects most likely appearing in a scene. To mitigate the noise in the LLM-generated descriptions that destabilizes the training process and may lead to overfitting, we incorporate knowledge distillation and external retrieval-augmented knowledge during pre-training. Furthermore, we present an effective VL model pre-trained with the created data. Empirically, despite its weaker cross-modal supervision, our pre-trained VL model notably outperforms other W-VLP works in image and text retrieval tasks, e.g., VLMixer by 17.7% on MSCOCO and RELIT by 11.25% on Flickr30K relatively in Recall@1 in text-to-image retrieval task. It also shows superior performance on other VL downstream tasks, making a big stride towards matching the performances of strongly supervised VLP models. The results reveal the effectiveness of the proposed W-VLP methodology.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"197 ","pages":"Pages 8-15"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016786552500248X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Weakly-supervised Vision-Language Pre-training (W-VLP) explores methods leveraging weak cross-modal supervision, typically relying on object tags generated by a pre-trained object detector (OD) from images. However, training such an OD necessitates dense cross-modal information, including images paired with numerous object-level annotations. To alleviate that requirement, this paper addresses W-VLP in two stages: (1) creating data with weaker cross-modal supervision and (2) pre-training a vision-language (VL) model with the created data. The data creation process involves collecting knowledge from large language models (LLMs) to describe images. Given a category label of an image, its descriptions generated by an LLM are used as the language counterpart. This knowledge supplements what can be obtained using an OD, such as spatial relationships among objects most likely appearing in a scene. To mitigate the noise in the LLM-generated descriptions that destabilizes the training process and may lead to overfitting, we incorporate knowledge distillation and external retrieval-augmented knowledge during pre-training. Furthermore, we present an effective VL model pre-trained with the created data. Empirically, despite its weaker cross-modal supervision, our pre-trained VL model notably outperforms other W-VLP works in image and text retrieval tasks, e.g., VLMixer by 17.7% on MSCOCO and RELIT by 11.25% on Flickr30K relatively in Recall@1 in text-to-image retrieval task. It also shows superior performance on other VL downstream tasks, making a big stride towards matching the performances of strongly supervised VLP models. The results reveal the effectiveness of the proposed W-VLP methodology.

查看原文本刊更多论文

基于提示的弱监督视觉语言预训练

弱监督视觉语言预训练（W-VLP）探索利用弱跨模态监督的方法，通常依赖于由预训练的对象检测器（OD）从图像生成的对象标签。然而，训练这样的OD需要密集的跨模态信息，包括与大量对象级注释配对的图像。为了缓解这一要求，本文分两个阶段解决W-VLP问题：(1)创建具有较弱跨模态监督的数据；(2)使用创建的数据预训练视觉语言（VL）模型。数据创建过程包括从大型语言模型（llm）收集知识来描述图像。给定图像的类别标签，由LLM生成的描述用作语言对应。这些知识补充了使用OD可以获得的信息，例如场景中最有可能出现的对象之间的空间关系。为了减轻llm生成的描述中的噪声，这些噪声会破坏训练过程的稳定性并可能导致过拟合，我们在预训练过程中结合了知识蒸馏和外部检索增强知识。此外，我们提出了一个有效的VL模型，用所创建的数据进行预训练。从经验上看，尽管其跨模态监督较弱，但我们的预训练VL模型在图像和文本检索任务中明显优于其他W-VLP作品，例如，在文本到图像检索任务中，VLMixer在MSCOCO上高出17.7%，RELIT在Flickr30K上相对于Recall@1高出11.25%。它在其他VL下游任务上也表现出优异的性能，在与强监督VLP模型的性能相匹配方面迈出了一大步。结果表明了所提出的W-VLP方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition Letters 工程技术-计算机：人工智能

CiteScore

12.40

自引率

5.90%

发文量

287

审稿时长

9.1 months

期刊介绍： Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.