视觉语言检索的跨模态适配器

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2024-11-03 DOI:10.1016/j.patcog.2024.111144

Haojun Jiang , Jianke Zhang , Rui Huang , Chunjiang Ge , Zanlin Ni , Shiji Song , Gao Huang

{"title":"视觉语言检索的跨模态适配器","authors":"Haojun Jiang , Jianke Zhang , Rui Huang , Chunjiang Ge , Zanlin Ni , Shiji Song , Gao Huang","doi":"10.1016/j.patcog.2024.111144","DOIUrl":null,"url":null,"abstract":"<div><div>Vision–language retrieval is an important multi-modal learning topic, where the goal is to retrieve the most relevant visual candidate for a given text query. Recently, pre-trained models, <em>e.g.</em>, CLIP, show great potential on retrieval tasks. However, as pre-trained models are scaling up, fully fine-tuning them on donwstream retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel <strong>Cross-Modal Adapter</strong> for parameter-efficient transfer learning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows encoder-level implicit cross-modal interactions between vision and language encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces the vast majority of fine-tuned parameters, (2) saves training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, our approach outperforms adapter-based methods on image–text retrieval datasets (MSCOCO, Flickr30K) and video–text retrieval datasets (MSR-VTT, DiDeMo, and ActivityNet).</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111144"},"PeriodicalIF":7.6000,"publicationDate":"2024-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cross-modal adapter for vision–language retrieval\",\"authors\":\"Haojun Jiang , Jianke Zhang , Rui Huang , Chunjiang Ge , Zanlin Ni , Shiji Song , Gao Huang\",\"doi\":\"10.1016/j.patcog.2024.111144\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Vision–language retrieval is an important multi-modal learning topic, where the goal is to retrieve the most relevant visual candidate for a given text query. Recently, pre-trained models, <em>e.g.</em>, CLIP, show great potential on retrieval tasks. However, as pre-trained models are scaling up, fully fine-tuning them on donwstream retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel <strong>Cross-Modal Adapter</strong> for parameter-efficient transfer learning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows encoder-level implicit cross-modal interactions between vision and language encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces the vast majority of fine-tuned parameters, (2) saves training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, our approach outperforms adapter-based methods on image–text retrieval datasets (MSCOCO, Flickr30K) and video–text retrieval datasets (MSR-VTT, DiDeMo, and ActivityNet).</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"159 \",\"pages\":\"Article 111144\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2024-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320324008951\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008951","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视觉语言检索是一个重要的多模态学习课题，其目标是为给定的文本查询检索最相关的视觉候选对象。最近，预训练模型（如 CLIP）在检索任务中显示出巨大的潜力。然而，由于预训练模型的规模不断扩大，在东流检索数据集上对其进行完全微调很有可能会造成过拟合。此外，在实践中，为每个任务训练和存储一个大型模型的成本很高。为了克服上述问题，我们提出了一种用于参数高效迁移学习的新型交叉模式适配器。受基于适配器的方法的启发，我们通过几个参数化层来调整预训练模型。不过，我们的方法有两点明显不同。首先，我们的方法是为多模态领域设计的。其次，它允许视觉编码器和语言编码器之间进行编码器级的隐式跨模态交互。虽然简单得出人意料，但我们的方法有三个显著的优点：(1) 减少了绝大多数微调参数；(2) 节省了训练时间；(3) 允许固定所有预训练参数，使预训练模型可以跨数据集共享。广泛的实验证明，在图像文本检索数据集（MSCOCO、Flickr30K）和视频文本检索数据集（MSR-VTT、DiDeMo 和 ActivityNet）上，我们的方法在没有附加功能的情况下优于基于适配器的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cross-modal adapter for vision–language retrieval

Vision–language retrieval is an important multi-modal learning topic, where the goal is to retrieve the most relevant visual candidate for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on retrieval tasks. However, as pre-trained models are scaling up, fully fine-tuning them on donwstream retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel Cross-Modal Adapter for parameter-efficient transfer learning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows encoder-level implicit cross-modal interactions between vision and language encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces the vast majority of fine-tuned parameters, (2) saves training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, our approach outperforms adapter-based methods on image–text retrieval datasets (MSCOCO, Flickr30K) and video–text retrieval datasets (MSR-VTT, DiDeMo, and ActivityNet).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.