Self-supervised adversarial learning for cross-modal retrieval

Yangchao Wang, Shiyuan He, Xing Xu, Yang Yang, Jingjing Li, Heng Tao Shen
{"title":"Self-supervised adversarial learning for cross-modal retrieval","authors":"Yangchao Wang, Shiyuan He, Xing Xu, Yang Yang, Jingjing Li, Heng Tao Shen","doi":"10.1145/3444685.3446269","DOIUrl":null,"url":null,"abstract":"Cross-modal retrieval aims at enabling flexible retrieval across different modalities. The core of cross-modal retrieval is to learn projections for different modalities and make instances in the learned common subspace comparable to each other. Self-supervised learning automatically creates a supervision signal by transformation of input data and learns semantic features by training to predict the artificial labels. In this paper, we proposed a novel method named Self-Supervised Adversarial Learning (SSAL) for Cross-Modal Retrieval, which deploys self-supervised learning and adversarial learning to seek an effective common subspace. A feature projector tries to generate modality-invariant representations in the common subspace that can confuse an adversarial discriminator consists of two classifiers. One of the classifiers aims to predict rotation angle from image representations, while the other classifier tries to discriminate between different modalities from the learned embeddings. By confusing the self-supervised adversarial model, feature projector filters out the abundant high-level visual semantics and learns image embeddings that are better aligned with text modality in the common subspace. Through the joint exploitation of the above, an effective common subspace is learned, in which representations of different modlities are aligned better and common information of different modalities is well preserved. Comprehensive experimental results on three widely-used benchmark datasets show that the proposed method is superior in cross-modal retrieval and significantly outperforms the existing cross-modal retrieval methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3444685.3446269","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Cross-modal retrieval aims at enabling flexible retrieval across different modalities. The core of cross-modal retrieval is to learn projections for different modalities and make instances in the learned common subspace comparable to each other. Self-supervised learning automatically creates a supervision signal by transformation of input data and learns semantic features by training to predict the artificial labels. In this paper, we proposed a novel method named Self-Supervised Adversarial Learning (SSAL) for Cross-Modal Retrieval, which deploys self-supervised learning and adversarial learning to seek an effective common subspace. A feature projector tries to generate modality-invariant representations in the common subspace that can confuse an adversarial discriminator consists of two classifiers. One of the classifiers aims to predict rotation angle from image representations, while the other classifier tries to discriminate between different modalities from the learned embeddings. By confusing the self-supervised adversarial model, feature projector filters out the abundant high-level visual semantics and learns image embeddings that are better aligned with text modality in the common subspace. Through the joint exploitation of the above, an effective common subspace is learned, in which representations of different modlities are aligned better and common information of different modalities is well preserved. Comprehensive experimental results on three widely-used benchmark datasets show that the proposed method is superior in cross-modal retrieval and significantly outperforms the existing cross-modal retrieval methods.
跨模态检索的自监督对抗学习
跨模态检索旨在实现跨不同模态的灵活检索。跨模态检索的核心是学习不同模态的投影,并使学习到的公共子空间中的实例具有可比性。自监督学习通过对输入数据的变换自动生成监督信号,并通过训练学习语义特征来预测人工标签。本文提出了一种新的跨模态检索方法——自监督对抗学习(SSAL),该方法利用自监督学习和对抗学习来寻找有效的公共子空间。特征投射器试图在公共子空间中生成模态不变的表示,这可以混淆由两个分类器组成的对抗性鉴别器。其中一个分类器旨在从图像表示中预测旋转角度,而另一个分类器试图从学习到的嵌入中区分不同的模态。通过混淆自监督对抗模型,特征投影器过滤掉丰富的高级视觉语义,并在公共子空间中学习与文本模态更好对齐的图像嵌入。通过对上述方法的综合利用,学习到一个有效的公共子空间,使不同模态的表示能更好地对齐,并能很好地保存不同模态的公共信息。在三个广泛使用的基准数据集上的综合实验结果表明,该方法具有较好的跨模态检索性能,显著优于现有的跨模态检索方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信