Learning Joint Multimodal Representation with Adversarial Attention Networks

Proceedings of the 26th ACM international conference on Multimedia Pub Date : 2018-10-15 DOI:10.1145/3240508.3240614

Feiran Huang, Xiaoming Zhang, Zhoujun Li

{"title":"Learning Joint Multimodal Representation with Adversarial Attention Networks","authors":"Feiran Huang, Xiaoming Zhang, Zhoujun Li","doi":"10.1145/3240508.3240614","DOIUrl":null,"url":null,"abstract":"Recently, learning a joint representation for the multimodal data (e.g., containing both visual content and text description) has attracted extensive research interests. Usually, the features of different modalities are correlational and compositive, and thus a joint representation capturing the correlation is more effective than a subset of the features. Most of existing multimodal representation learning methods suffer from lack of additional constraints to enhance the robustness of the learned representations. In this paper, a novel Adversarial Attention Networks (AAN) is proposed to incorporate both the attention mechanism and the adversarial networks for effective and robust multimodal representation learning. Specifically, a visual-semantic attention model with siamese learning strategy is proposed to encode the fine-grained correlation between visual and textual modalities. Meanwhile, the adversarial learning model is employed to regularize the generated representation by matching the posterior distribution of the representation to the given priors. Then, the two modules are incorporated into a integrated learning framework to learn the joint multimodal representation. Experimental results in two tasks, i.e., multi-label classification and tag recommendation, show that the proposed model outperforms state-of-the-art representation learning methods.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240508.3240614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

Recently, learning a joint representation for the multimodal data (e.g., containing both visual content and text description) has attracted extensive research interests. Usually, the features of different modalities are correlational and compositive, and thus a joint representation capturing the correlation is more effective than a subset of the features. Most of existing multimodal representation learning methods suffer from lack of additional constraints to enhance the robustness of the learned representations. In this paper, a novel Adversarial Attention Networks (AAN) is proposed to incorporate both the attention mechanism and the adversarial networks for effective and robust multimodal representation learning. Specifically, a visual-semantic attention model with siamese learning strategy is proposed to encode the fine-grained correlation between visual and textual modalities. Meanwhile, the adversarial learning model is employed to regularize the generated representation by matching the posterior distribution of the representation to the given priors. Then, the two modules are incorporated into a integrated learning framework to learn the joint multimodal representation. Experimental results in two tasks, i.e., multi-label classification and tag recommendation, show that the proposed model outperforms state-of-the-art representation learning methods.

查看原文本刊更多论文

用对抗性注意网络学习联合多模态表示

最近，学习多模态数据的联合表示(例如，同时包含视觉内容和文本描述)引起了广泛的研究兴趣。通常，不同模态的特征是相互关联和综合的，因此捕获相关性的联合表示比特征的子集更有效。现有的多模态表示学习方法大多缺乏额外的约束来增强学习表征的鲁棒性。本文提出了一种新的对抗注意网络(AAN)，它将注意力机制和对抗网络结合起来，以实现有效和鲁棒的多模态表征学习。具体而言，提出了一种具有连体学习策略的视觉语义注意模型，对视觉和文本模式之间的细粒度关联进行编码。同时，采用对抗学习模型，通过将生成的表示的后验分布与给定的先验匹配，对生成的表示进行正则化。然后，将这两个模块整合到一个集成学习框架中，学习联合多模态表示。在多标签分类和标签推荐两个任务中的实验结果表明，该模型优于当前最先进的表示学习方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 26th ACM international conference on Multimedia

自引率

0.00%

发文量