多模态跨模态检索的对抗性图注意网络

2019 IEEE International Conference on Big Knowledge (ICBK) Pub Date : 2019-11-01 DOI:10.1109/ICBK.2019.00043

Hongchang Wu, Ziyu Guan, Tao Zhi, Wei Zhao, Cai Xu, Hong Han, Yaming Yang

{"title":"多模态跨模态检索的对抗性图注意网络","authors":"Hongchang Wu, Ziyu Guan, Tao Zhi, Wei Zhao, Cai Xu, Hong Han, Yaming Yang","doi":"10.1109/ICBK.2019.00043","DOIUrl":null,"url":null,"abstract":"Existing cross-modal retrieval methods are mainly constrained to the bimodal case. When applied to the multi-modal case, we need to train O(K^2) (K: number of modalities) separate models, which is inefficient and unable to exploit common information among multiple modalities. Though some studies focused on learning a common space of multiple modalities for retrieval, they assumed data to be i.i.d. and failed to learn the underlying semantic structure which could be important for retrieval. To tackle this issue, we propose an extensive Adversarial Graph Attention Network for Multi-modal Cross-modal Retrieval (AGAT). AGAT synthesizes a self-attention network (SAT), a graph attention network (GAT) and a multi-modal generative adversarial network (MGAN). The SAT generates high-level embeddings for data items from different modalities, with self-attention capturing feature-level correlations in each modality. The GAT then uses attention to aggregate embeddings of matched items from different modalities to build a common embedding space. The MGAN aims to \"cluster\" matched embeddings of different modalities in the common space by forcing them to be similar to the aggregation. Finally, we train the common space so that it captures the semantic structure by constraining within-class/between-class distances. Experiments on three datasets show the effectiveness of AGAT.","PeriodicalId":383917,"journal":{"name":"2019 IEEE International Conference on Big Knowledge (ICBK)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Adversarial Graph Attention Network for Multi-modal Cross-Modal Retrieval\",\"authors\":\"Hongchang Wu, Ziyu Guan, Tao Zhi, Wei Zhao, Cai Xu, Hong Han, Yaming Yang\",\"doi\":\"10.1109/ICBK.2019.00043\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Existing cross-modal retrieval methods are mainly constrained to the bimodal case. When applied to the multi-modal case, we need to train O(K^2) (K: number of modalities) separate models, which is inefficient and unable to exploit common information among multiple modalities. Though some studies focused on learning a common space of multiple modalities for retrieval, they assumed data to be i.i.d. and failed to learn the underlying semantic structure which could be important for retrieval. To tackle this issue, we propose an extensive Adversarial Graph Attention Network for Multi-modal Cross-modal Retrieval (AGAT). AGAT synthesizes a self-attention network (SAT), a graph attention network (GAT) and a multi-modal generative adversarial network (MGAN). The SAT generates high-level embeddings for data items from different modalities, with self-attention capturing feature-level correlations in each modality. The GAT then uses attention to aggregate embeddings of matched items from different modalities to build a common embedding space. The MGAN aims to \\\"cluster\\\" matched embeddings of different modalities in the common space by forcing them to be similar to the aggregation. Finally, we train the common space so that it captures the semantic structure by constraining within-class/between-class distances. Experiments on three datasets show the effectiveness of AGAT.\",\"PeriodicalId\":383917,\"journal\":{\"name\":\"2019 IEEE International Conference on Big Knowledge (ICBK)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Conference on Big Knowledge (ICBK)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICBK.2019.00043\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Conference on Big Knowledge (ICBK)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICBK.2019.00043","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

现有的跨模态检索方法主要局限于双峰情况。当应用于多模态情况时，我们需要训练O(K^2) (K:模态数)个单独的模型，这是低效的，并且无法利用多个模态之间的共同信息。虽然一些研究侧重于学习多模态的公共空间进行检索，但他们假设数据是i.id的，未能学习到对检索很重要的底层语义结构。为了解决这个问题，我们提出了一个广泛的多模态跨模态检索(AGAT)的对抗图注意网络。AGAT综合了自注意网络(SAT)、图注意网络(GAT)和多模态生成对抗网络(MGAN)。SAT为来自不同模态的数据项生成高级嵌入，并在每个模态中捕获自关注的特征级相关性。然后，GAT利用注意力聚合来自不同模态的匹配项的嵌入，以构建一个公共嵌入空间。MGAN的目标是通过迫使不同模式的匹配嵌入在公共空间中与聚合相似来“聚类”它们。最后，我们训练公共空间，使其通过约束类内/类间距离来捕获语义结构。在三个数据集上的实验证明了AGAT算法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Adversarial Graph Attention Network for Multi-modal Cross-Modal Retrieval

Existing cross-modal retrieval methods are mainly constrained to the bimodal case. When applied to the multi-modal case, we need to train O(K^2) (K: number of modalities) separate models, which is inefficient and unable to exploit common information among multiple modalities. Though some studies focused on learning a common space of multiple modalities for retrieval, they assumed data to be i.i.d. and failed to learn the underlying semantic structure which could be important for retrieval. To tackle this issue, we propose an extensive Adversarial Graph Attention Network for Multi-modal Cross-modal Retrieval (AGAT). AGAT synthesizes a self-attention network (SAT), a graph attention network (GAT) and a multi-modal generative adversarial network (MGAN). The SAT generates high-level embeddings for data items from different modalities, with self-attention capturing feature-level correlations in each modality. The GAT then uses attention to aggregate embeddings of matched items from different modalities to build a common embedding space. The MGAN aims to "cluster" matched embeddings of different modalities in the common space by forcing them to be similar to the aggregation. Finally, we train the common space so that it captures the semantic structure by constraining within-class/between-class distances. Experiments on three datasets show the effectiveness of AGAT.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Conference on Big Knowledge (ICBK)

自引率

0.00%

发文量