Student Can Also be a Good Teacher: Extracting Knowledge from Vision-and-Language Model for Cross-Modal Retrieval

Proceedings of the 30th ACM International Conference on Information & Knowledge Management Pub Date : 2021-10-26 DOI:10.1145/3459637.3482194

Jun Rao, Tao Qian, Shuhan Qi, Yulin Wu, Qing Liao, Xuan Wang

{"title":"Student Can Also be a Good Teacher: Extracting Knowledge from Vision-and-Language Model for Cross-Modal Retrieval","authors":"Jun Rao, Tao Qian, Shuhan Qi, Yulin Wu, Qing Liao, Xuan Wang","doi":"10.1145/3459637.3482194","DOIUrl":null,"url":null,"abstract":"Astounding results from transformer models with Vision-and Language Pretraining (VLP) on joint vision-and-language downstream tasks have intrigued the multi-modal community. On the one hand, these models are usually so huge that make us more difficult to fine-tune and serve real-time online applications. On the other hand, the compression of the original transformer block will ignore the difference in information between modalities, which leads to the sharp decline of retrieval accuracy. In this work, we present a very light and effective cross-modal retrieval model compression method. With this method, by adopting a novel random replacement strategy and knowledge distillation, our module can learn the knowledge of the teacher with nearly the half number of parameters reduction. Furthermore, our compression method achieves nearly 130x acceleration with acceptable accuracy. To overcome the sharp decline in retrieval tasks because of compression, we introduce the co-attention interaction module to reflect the different information and interaction information. Experiments show that a multi-modal co-attention block is more suitable for cross-modal retrieval tasks rather than the source transformer encoder block.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3459637.3482194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Astounding results from transformer models with Vision-and Language Pretraining (VLP) on joint vision-and-language downstream tasks have intrigued the multi-modal community. On the one hand, these models are usually so huge that make us more difficult to fine-tune and serve real-time online applications. On the other hand, the compression of the original transformer block will ignore the difference in information between modalities, which leads to the sharp decline of retrieval accuracy. In this work, we present a very light and effective cross-modal retrieval model compression method. With this method, by adopting a novel random replacement strategy and knowledge distillation, our module can learn the knowledge of the teacher with nearly the half number of parameters reduction. Furthermore, our compression method achieves nearly 130x acceleration with acceptable accuracy. To overcome the sharp decline in retrieval tasks because of compression, we introduce the co-attention interaction module to reflect the different information and interaction information. Experiments show that a multi-modal co-attention block is more suitable for cross-modal retrieval tasks rather than the source transformer encoder block.

查看原文本刊更多论文

学生也可以成为好老师:从视觉和语言模型中提取知识进行跨模态检索

使用视觉和语言预训练(VLP)的变压器模型在联合视觉和语言下游任务上的惊人结果引起了多模态社区的兴趣。一方面，这些模型通常是如此巨大，使我们更难以微调和提供实时在线应用程序。另一方面，原始变压器块的压缩会忽略模态之间的信息差异，导致检索精度急剧下降。在这项工作中，我们提出了一种非常轻和有效的跨模态检索模型压缩方法。该方法采用了一种新颖的随机替换策略和知识精馏，模块可以在参数约简近一半的情况下学习到教师的知识。此外，我们的压缩方法在可接受的精度下实现了近130倍的加速度。为了克服由于压缩导致的检索任务急剧下降的问题，我们引入了共同关注交互模块来反映不同的信息和交互信息。实验表明，多模态共注意块比源转换器编码器块更适合于跨模态检索任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 30th ACM International Conference on Information & Knowledge Management

自引率

0.00%

发文量