Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning

IF 4.6 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers of Computer Science Pub Date : 2023-12-02 DOI:10.1007/s11704-023-3186-6

Yang Yang, Jinyi Guo, Guangyu Li, Lanyu Li, Wenjie Li, Jian Yang

{"title":"Alignment efficient image-sentence retrieval considering transferable cross-modal representation learning","authors":"Yang Yang, Jinyi Guo, Guangyu Li, Lanyu Li, Wenjie Li, Jian Yang","doi":"10.1007/s11704-023-3186-6","DOIUrl":null,"url":null,"abstract":"<p>Traditional image-sentence cross-modal retrieval methods usually aim to learn consistent representations of heterogeneous modalities, thereby to search similar instances in one modality according to the query from another modality in result. The basic assumption behind these methods is that parallel multi-modal data (i.e., different modalities of the same example are aligned) can be obtained in prior. In other words, the image-sentence cross-modal retrieval task is a supervised task with the alignments as ground-truths. However, in many real-world applications, it is difficult to realign a large amount of parallel data for new scenarios due to the substantial labor costs, leading the non-parallel multi-modal data and existing methods cannot be used directly. On the other hand, there actually exists auxiliary parallel multi-modal data with similar semantics, which can assist the non-parallel data to learn the consistent representations. Therefore, in this paper, we aim at “Alignment Efficient Image-Sentence Retrieval” (AEIR), which recurs to the auxiliary parallel image-sentence data as the source domain data, and takes the non-parallel data as the target domain data. Unlike single-modal transfer learning, AEIR learns consistent image-sentence cross-modal representations of target domain by transferring the alignments of existing parallel data. Specifically, AEIR learns the image-sentence consistent representations in source domain with parallel data, while transferring the alignment knowledge across domains by jointly optimizing a novel designed cross-domain cross-modal metric learning based constraint with intra-modal domain adversarial loss. Consequently, we can effectively learn the consistent representations for target domain considering both the structure and semantic transfer. Furthermore, extensive experiments on different transfer scenarios validate that AEIR can achieve better retrieval results comparing with the baselines.</p>","PeriodicalId":12640,"journal":{"name":"Frontiers of Computer Science","volume":"21 1","pages":""},"PeriodicalIF":4.6000,"publicationDate":"2023-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers of Computer Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11704-023-3186-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Traditional image-sentence cross-modal retrieval methods usually aim to learn consistent representations of heterogeneous modalities, thereby to search similar instances in one modality according to the query from another modality in result. The basic assumption behind these methods is that parallel multi-modal data (i.e., different modalities of the same example are aligned) can be obtained in prior. In other words, the image-sentence cross-modal retrieval task is a supervised task with the alignments as ground-truths. However, in many real-world applications, it is difficult to realign a large amount of parallel data for new scenarios due to the substantial labor costs, leading the non-parallel multi-modal data and existing methods cannot be used directly. On the other hand, there actually exists auxiliary parallel multi-modal data with similar semantics, which can assist the non-parallel data to learn the consistent representations. Therefore, in this paper, we aim at “Alignment Efficient Image-Sentence Retrieval” (AEIR), which recurs to the auxiliary parallel image-sentence data as the source domain data, and takes the non-parallel data as the target domain data. Unlike single-modal transfer learning, AEIR learns consistent image-sentence cross-modal representations of target domain by transferring the alignments of existing parallel data. Specifically, AEIR learns the image-sentence consistent representations in source domain with parallel data, while transferring the alignment knowledge across domains by jointly optimizing a novel designed cross-domain cross-modal metric learning based constraint with intra-modal domain adversarial loss. Consequently, we can effectively learn the consistent representations for target domain considering both the structure and semantic transfer. Furthermore, extensive experiments on different transfer scenarios validate that AEIR can achieve better retrieval results comparing with the baselines.

查看原文本刊更多论文

考虑可转移跨模态表示学习的对齐高效图像句子检索

传统的图像-句子跨模态检索方法通常旨在学习异构模态的一致表示，从而根据结果中来自另一模态的查询来搜索一个模态中的相似实例。这些方法背后的基本假设是可以预先获得并行多模态数据(即同一示例的不同模态对齐)。换句话说，图像-句子跨模态检索任务是一个以对齐为基础事实的监督任务。然而，在许多实际应用中，由于大量的并行数据难以重新调整到新的场景，导致非并行多模态数据和现有方法无法直接使用。另一方面，实际上存在语义相似的辅助并行多模态数据，可以帮助非并行数据学习一致的表示。因此，本文以“对齐高效图像句子检索”(AEIR)为研究目标，即以辅助的并行图像句子数据为源域数据，以非并行数据为目标域数据。与单模态迁移学习不同，AEIR通过转移现有并行数据的对齐来学习目标域一致的图像-句子跨模态表示。具体来说，AEIR利用并行数据在源域中学习图像-句子一致表示，同时通过联合优化设计的基于模内域对抗损失的跨域跨模态度量学习约束，跨域传递对齐知识。因此，在考虑结构和语义迁移的情况下，我们可以有效地学习目标域的一致表示。此外，在不同传输场景下的大量实验验证了与基线相比，AEIR可以获得更好的检索结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers of Computer Science COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

8.60

自引率

2.40%

发文量

799

审稿时长

6-12 weeks

期刊介绍： Frontiers of Computer Science aims to provide a forum for the publication of peer-reviewed papers to promote rapid communication and exchange between computer scientists. The journal publishes research papers and review articles in a wide range of topics, including: architecture, software, artificial intelligence, theoretical computer science, networks and communication, information systems, multimedia and graphics, information security, interdisciplinary, etc. The journal especially encourages papers from new emerging and multidisciplinary areas, as well as papers reflecting the international trends of research and development and on special topics reporting progress made by Chinese computer scientists.