DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

ACM Transactions on Knowledge Discovery from Data (TKDD) Pub Date : 2021-07-03 DOI:10.1145/3447685

Fenglin Liu, Xuancheng Ren

{"title":"DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention","authors":"Fenglin Liu, Xuancheng Ren","doi":"10.1145/3447685","DOIUrl":null,"url":null,"abstract":"Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space. To tackle this problem, we propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image–sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts.","PeriodicalId":435653,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data (TKDD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data (TKDD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447685","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space. To tackle this problem, we propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image–sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts.

查看原文本刊更多论文

用解纠缠多模态注意学习基于视觉语言的表征

视觉和语言(V-L)任务要求系统同时理解视觉内容和自然语言，因此学习视觉和语言的细粒度联合表示(又称V-L表示)至关重要。最近，人们提出了各种预训练的V-L模型来学习V-L表示，并在许多任务中取得了改进的结果。然而，主流模型同时处理视觉和语言输入，使用同一组注意矩阵。因此，生成的V-L表示在一个公共潜在空间中纠缠。为了解决这个问题，我们提出了一个新的框架DiMBERT (Disentangled Multimodal-Attention BERT的缩写)，它将视觉和语言的注意空间分开，从而可以明确地解开多模态的表征。为了增强视觉和语言在非纠缠空间中的相关性，我们在DiMBERT中引入了以文本形式表示视觉信息的视觉概念。通过这种方式，视觉概念有助于弥合两种模式之间的差距。我们在大量的图像-句子对上对DiMBERT进行了两个任务的预训练:双向语言建模和序列-序列语言建模。在预训练之后，DiMBERT将进一步针对下游任务进行微调。实验表明，DiMBERT在三个任务(超过四个数据集)上设置了新的最先进的性能，包括生成任务(图像字幕和视觉故事)和分类任务(引用表达式)。提出的DiM (Disentangled Multimodal-Attention的缩写)模块可以很容易地整合到现有的预训练V-L模型中，以提高它们的性能，在代表性任务上最多可提高5%。最后，我们进行了系统的分析，并证明了我们的DiM和引入的视觉概念的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Knowledge Discovery from Data (TKDD)

自引率

0.00%

发文量