Multimodal Fusion with Co-attention Mechanism

Pei Li, Xinde Li
{"title":"Multimodal Fusion with Co-attention Mechanism","authors":"Pei Li, Xinde Li","doi":"10.23919/FUSION45008.2020.9190483","DOIUrl":null,"url":null,"abstract":"Because the information from different modalities will complement each other when describing the same contents, multimodal information can be used to obtain better feature representations. Thus, how to represent and fuse the relevant information has become a current research topic. At present, most of the existing feature fusion methods consider the different levels of features representations, but they ignore the significant relevance between the local regions, especially in the high-level semantic representation. In this paper, a general multimodal fusion method based on the co-attention mechanism is proposed, which is similar to the transformer structure. We discuss two main issues: (1) Improving the applicability and generality of the transformer to different modal data; (2) By capturing and transmitting the relevant information between local features before fusion, the proposed method can allow for more robustness. We evaluate our model on the multimodal classification task, and the experiments demonstrate that our model can learn fused featnre representation effectively.","PeriodicalId":419881,"journal":{"name":"2020 IEEE 23rd International Conference on Information Fusion (FUSION)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 23rd International Conference on Information Fusion (FUSION)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/FUSION45008.2020.9190483","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Because the information from different modalities will complement each other when describing the same contents, multimodal information can be used to obtain better feature representations. Thus, how to represent and fuse the relevant information has become a current research topic. At present, most of the existing feature fusion methods consider the different levels of features representations, but they ignore the significant relevance between the local regions, especially in the high-level semantic representation. In this paper, a general multimodal fusion method based on the co-attention mechanism is proposed, which is similar to the transformer structure. We discuss two main issues: (1) Improving the applicability and generality of the transformer to different modal data; (2) By capturing and transmitting the relevant information between local features before fusion, the proposed method can allow for more robustness. We evaluate our model on the multimodal classification task, and the experiments demonstrate that our model can learn fused featnre representation effectively.
基于共同注意机制的多模态融合
由于来自不同模态的信息在描述相同内容时会相互补充,因此可以使用多模态信息来获得更好的特征表示。因此,如何表示和融合相关信息成为当前的研究课题。目前,现有的特征融合方法大多考虑了不同层次的特征表示,但忽略了局部区域之间的显著相关性,特别是在高层语义表示方面。本文提出了一种基于共同关注机制的通用多模态融合方法,该方法与变压器结构类似。本文主要讨论两个问题:(1)提高变压器对不同模态数据的适用性和通用性;(2)通过在融合前捕获和传输局部特征之间的相关信息,增强了算法的鲁棒性。在多模态分类任务上对该模型进行了评价,实验表明该模型能够有效地学习融合特征表示。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信