Modal-adversarial Semantic Learning Network for Extendable Cross-modal Retrieval

Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval Pub Date : 2018-06-05 DOI:10.1145/3206025.3206033

Xing Xu, Jingkuan Song, Huimin Lu, Yang Yang, Fumin Shen, Zi Huang

{"title":"Modal-adversarial Semantic Learning Network for Extendable Cross-modal Retrieval","authors":"Xing Xu, Jingkuan Song, Huimin Lu, Yang Yang, Fumin Shen, Zi Huang","doi":"10.1145/3206025.3206033","DOIUrl":null,"url":null,"abstract":"Cross-modal retrieval, e.g., using an image query to search related text and vice-versa, has become a highlighted research topic, to provide flexible retrieval experience across multi-modal data. Existing approaches usually consider the so-called non-extendable cross-modal retrieval task. In this task, they learn a common latent subspace from a source set containing labeled instances of image-text pairs and then generate common representation for the instances in a target set to perform cross-modal matching. However, these method may not generalize well when the instances of the target set contains unseen classes since the instances of both the source and target set are assumed to share the same range of classes in the non-extensive cross-modal retrieval task. In this paper, we consider a more practical issue of extendable cross-modal retrieval task where instances in source and target set have disjoint classes. We propose a novel framework, termed Modal-adversarial Semantic Learning Network (MASLN), to tackle the limitation of existing methods on this practical task. Specifically, the proposed MASLN consists two subnetworks of cross-modal reconstruction and modal-adversarial semantic learning. The former minimizes the cross-modal distribution discrepancy by reconstructing each modality data mutually, with the guidelines of class embeddings as side information in the reconstruction procedure. The latter generates semantic representation to be indiscriminative for modalities, while to distinguish the modalities from the common representation via an adversarial learning mechanism. The two subnetworks are jointly trained to enhance the cross-modal semantic consistency in the learned common subspace and the knowledge transfer to instances in the target set. Comprehensive experiment on three widely-used multi-modal datasets show its effectiveness and robustness on both non-extendable and extendable cross-modal retrieval task.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"179 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3206025.3206033","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

Cross-modal retrieval, e.g., using an image query to search related text and vice-versa, has become a highlighted research topic, to provide flexible retrieval experience across multi-modal data. Existing approaches usually consider the so-called non-extendable cross-modal retrieval task. In this task, they learn a common latent subspace from a source set containing labeled instances of image-text pairs and then generate common representation for the instances in a target set to perform cross-modal matching. However, these method may not generalize well when the instances of the target set contains unseen classes since the instances of both the source and target set are assumed to share the same range of classes in the non-extensive cross-modal retrieval task. In this paper, we consider a more practical issue of extendable cross-modal retrieval task where instances in source and target set have disjoint classes. We propose a novel framework, termed Modal-adversarial Semantic Learning Network (MASLN), to tackle the limitation of existing methods on this practical task. Specifically, the proposed MASLN consists two subnetworks of cross-modal reconstruction and modal-adversarial semantic learning. The former minimizes the cross-modal distribution discrepancy by reconstructing each modality data mutually, with the guidelines of class embeddings as side information in the reconstruction procedure. The latter generates semantic representation to be indiscriminative for modalities, while to distinguish the modalities from the common representation via an adversarial learning mechanism. The two subnetworks are jointly trained to enhance the cross-modal semantic consistency in the learned common subspace and the knowledge transfer to instances in the target set. Comprehensive experiment on three widely-used multi-modal datasets show its effectiveness and robustness on both non-extendable and extendable cross-modal retrieval task.

查看原文本刊更多论文

可扩展跨模态检索的模态对抗语义学习网络

跨模态检索，如使用图像查询来搜索相关文本，反之亦然，已成为一个热点研究课题，以提供灵活的跨多模态数据检索体验。现有方法通常考虑所谓的不可扩展跨模态检索任务。在这个任务中，他们从包含图像-文本对的标记实例的源集中学习一个公共潜在子空间，然后为目标集中的实例生成公共表示来执行跨模态匹配。然而，当目标集的实例包含未见过的类时，这些方法可能不能很好地泛化，因为在非扩展的跨模态检索任务中，源集和目标集的实例都假定共享相同范围的类。在本文中，我们考虑了一个更实际的可扩展的跨模态检索任务，其中源集和目标集中的实例具有不相交的类。我们提出了一个新的框架，称为模态对抗语义学习网络(MASLN)，以解决现有方法在这一实际任务上的局限性。具体而言，本文提出的MASLN包括跨模态重构和模态对抗语义学习两个子网络。前者通过相互重建每个模态数据来最小化跨模态分布差异，在重建过程中以类嵌入的指导方针作为侧信息。后者生成的语义表征对模态是不分青红皂白的，而通过对抗性学习机制将模态与共同表征区分开来。通过对两个子网络进行联合训练，增强了学习到的公共子空间的跨模态语义一致性和向目标集中实例的知识转移。在三个广泛使用的多模态数据集上的综合实验表明，该方法在不可扩展和可扩展的跨模态检索任务上都具有良好的有效性和鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

自引率

0.00%

发文量