CPT:基于颜色的提示调整,用于预训练的视觉语言模型

Yuan Yao , Ao Zhang , Zhengyan Zhang , Zhiyuan Liu , Tat-Seng Chua , Maosong Sun
{"title":"CPT:基于颜色的提示调整,用于预训练的视觉语言模型","authors":"Yuan Yao ,&nbsp;Ao Zhang ,&nbsp;Zhengyan Zhang ,&nbsp;Zhiyuan Liu ,&nbsp;Tat-Seng Chua ,&nbsp;Maosong Sun","doi":"10.1016/j.aiopen.2024.01.004","DOIUrl":null,"url":null,"abstract":"<div><p>Vision-Language Pre-training (VLP) models have shown promising capabilities in grounding natural language in image data, facilitating a broad range of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for large amounts of labeled data to stimulate the visual grounding capability of VLP models for downstream tasks. To address the challenge, we present <strong>C</strong>olor-based <strong>P</strong>rompt <strong>T</strong>uning (CPT), a novel paradigm for tuning VLP models, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, CPT enables strong few-shot and even zero-shot visual grounding capabilities of VLP models. Comprehensive experimental results show that CPT achieves state-of-the-art performance on zero/few-shot visual grounding (e.g., 75.1 zero-shot accuracy in RefCOCO evaluation), outperforming fine-tuned and other prompt-tuned models by a large margin. Moreover, CPT can also be easily extended to achieve promising zero/few-shot performance on other vision-language tasks, such as visual relation detection, visual commonsense reasoning and visual question answering. We make the data and codes publicly available at <span>https://github.com/thunlp/CPT</span><svg><path></path></svg>.</p></div>","PeriodicalId":100068,"journal":{"name":"AI Open","volume":"5 ","pages":"Pages 30-38"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666651024000056/pdfft?md5=a0b3ea3b64a989f20cbd8db1f84428c6&pid=1-s2.0-S2666651024000056-main.pdf","citationCount":"0","resultStr":"{\"title\":\"CPT: Colorful Prompt Tuning for pre-trained vision-language models\",\"authors\":\"Yuan Yao ,&nbsp;Ao Zhang ,&nbsp;Zhengyan Zhang ,&nbsp;Zhiyuan Liu ,&nbsp;Tat-Seng Chua ,&nbsp;Maosong Sun\",\"doi\":\"10.1016/j.aiopen.2024.01.004\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Vision-Language Pre-training (VLP) models have shown promising capabilities in grounding natural language in image data, facilitating a broad range of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for large amounts of labeled data to stimulate the visual grounding capability of VLP models for downstream tasks. To address the challenge, we present <strong>C</strong>olor-based <strong>P</strong>rompt <strong>T</strong>uning (CPT), a novel paradigm for tuning VLP models, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, CPT enables strong few-shot and even zero-shot visual grounding capabilities of VLP models. Comprehensive experimental results show that CPT achieves state-of-the-art performance on zero/few-shot visual grounding (e.g., 75.1 zero-shot accuracy in RefCOCO evaluation), outperforming fine-tuned and other prompt-tuned models by a large margin. Moreover, CPT can also be easily extended to achieve promising zero/few-shot performance on other vision-language tasks, such as visual relation detection, visual commonsense reasoning and visual question answering. We make the data and codes publicly available at <span>https://github.com/thunlp/CPT</span><svg><path></path></svg>.</p></div>\",\"PeriodicalId\":100068,\"journal\":{\"name\":\"AI Open\",\"volume\":\"5 \",\"pages\":\"Pages 30-38\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666651024000056/pdfft?md5=a0b3ea3b64a989f20cbd8db1f84428c6&pid=1-s2.0-S2666651024000056-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AI Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666651024000056\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AI Open","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666651024000056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

视觉语言预训练(VLP)模型在图像数据的自然语言基础方面表现出了良好的能力,从而促进了广泛的跨模态任务。然而,我们注意到,模型预训练和微调的客观形式之间存在很大差距,因此需要大量标注数据来激发 VLP 模型的视觉接地能力,以完成下游任务。为了应对这一挑战,我们提出了基于颜色的提示调整(CPT),这是一种用于调整 VLP 模型的新型范式,它将视觉接地重新组合为一个填空问题,在图像和文本中使用基于颜色的共同参照标记,从而最大限度地缩小了差距。通过这种方法,CPT 使 VLP 模型具有强大的少镜头甚至零镜头视觉接地能力。综合实验结果表明,CPT 在零镜头/少镜头视觉接地方面达到了最先进的性能(例如,在 RefCOCO 评估中,零镜头准确率为 75.1),远远优于微调模型和其他及时调整模型。此外,CPT 还可以很容易地扩展到其他视觉语言任务,如视觉关系检测、视觉常识推理和视觉问题解答等,以获得令人满意的零/少镜头性能。我们在 https://github.com/thunlp/CPT 上公开了数据和代码。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
CPT: Colorful Prompt Tuning for pre-trained vision-language models

Vision-Language Pre-training (VLP) models have shown promising capabilities in grounding natural language in image data, facilitating a broad range of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for large amounts of labeled data to stimulate the visual grounding capability of VLP models for downstream tasks. To address the challenge, we present Color-based Prompt Tuning (CPT), a novel paradigm for tuning VLP models, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, CPT enables strong few-shot and even zero-shot visual grounding capabilities of VLP models. Comprehensive experimental results show that CPT achieves state-of-the-art performance on zero/few-shot visual grounding (e.g., 75.1 zero-shot accuracy in RefCOCO evaluation), outperforming fine-tuned and other prompt-tuned models by a large margin. Moreover, CPT can also be easily extended to achieve promising zero/few-shot performance on other vision-language tasks, such as visual relation detection, visual commonsense reasoning and visual question answering. We make the data and codes publicly available at https://github.com/thunlp/CPT.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
45.00
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信