Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval

Proceedings of the 26th ACM international conference on Multimedia Pub Date : 2018-10-15 DOI:10.1145/3240508.3240607

Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, D. Tao, Qi Tian

{"title":"Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval","authors":"Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, D. Tao, Qi Tian","doi":"10.1145/3240508.3240607","DOIUrl":null,"url":null,"abstract":"In this paper, we propose a novel method with comprehensive distance-preserving autoencoders (CDPAE) to address the problem of unsupervised cross-modal retrieval. Previous unsupervised methods rely primarily on pairwise distances of representations extracted from cross media spaces that co-occur and belong to the same objects. However, besides pairwise distances, the CDPAE also considers heterogeneous distances of representations extracted from cross media spaces as well as homogeneous distances of representations extracted from single media spaces that belong to different objects. The CDPAE consists of four components. First, denoising autoencoders are used to retain the information from the representations and to reduce the negative influence of redundant noises. Second, a comprehensive distance-preserving common space is proposed to explore the correlations among different representations. This aims to preserve the respective distances between the representations within the common space so that they are consistent with the distances in their original media spaces. Third, a novel joint loss function is defined to simultaneously calculate the reconstruction loss of the denoising autoencoders and the correlation loss of the comprehensive distance-preserving common space. Finally, an unsupervised cross-modal similarity measurement is proposed to further improve the retrieval performance. This is carried out by calculating the marginal probability of two media objects based on a kNN classifier. The CDPAE is tested on four public datasets with two cross-modal retrieval tasks: \"query images by texts\" and \"query texts by images\". Compared with eight state-of-the-art cross-modal retrieval methods, the experimental results demonstrate that the CDPAE outperforms all the unsupervised methods and performs competitively with the supervised methods.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240508.3240607","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

In this paper, we propose a novel method with comprehensive distance-preserving autoencoders (CDPAE) to address the problem of unsupervised cross-modal retrieval. Previous unsupervised methods rely primarily on pairwise distances of representations extracted from cross media spaces that co-occur and belong to the same objects. However, besides pairwise distances, the CDPAE also considers heterogeneous distances of representations extracted from cross media spaces as well as homogeneous distances of representations extracted from single media spaces that belong to different objects. The CDPAE consists of four components. First, denoising autoencoders are used to retain the information from the representations and to reduce the negative influence of redundant noises. Second, a comprehensive distance-preserving common space is proposed to explore the correlations among different representations. This aims to preserve the respective distances between the representations within the common space so that they are consistent with the distances in their original media spaces. Third, a novel joint loss function is defined to simultaneously calculate the reconstruction loss of the denoising autoencoders and the correlation loss of the comprehensive distance-preserving common space. Finally, an unsupervised cross-modal similarity measurement is proposed to further improve the retrieval performance. This is carried out by calculating the marginal probability of two media objects based on a kNN classifier. The CDPAE is tested on four public datasets with two cross-modal retrieval tasks: "query images by texts" and "query texts by images". Compared with eight state-of-the-art cross-modal retrieval methods, the experimental results demonstrate that the CDPAE outperforms all the unsupervised methods and performs competitively with the supervised methods.

查看原文本刊更多论文

跨模态检索的综合距离保持自编码器

本文提出了一种利用综合距离保持自编码器(CDPAE)解决无监督跨模态检索问题的新方法。以前的无监督方法主要依赖于从跨媒体空间中提取的表征的成对距离，这些空间共同出现并且属于相同的对象。然而，除了两两距离之外，CDPAE还考虑了从跨媒体空间提取的表示的异构距离，以及从属于不同对象的单一媒体空间提取的表示的同质距离。CDPAE由四个部分组成。首先，使用去噪自编码器来保留表征中的信息，并减少冗余噪声的负面影响。其次，提出了一个全面的距离保持公共空间，以探索不同表示之间的相关性。这样做的目的是在公共空间中保持不同表现形式之间的距离，使它们与原始媒体空间中的距离保持一致。第三，定义了一种新的联合损失函数，用于同时计算去噪自编码器的重构损失和综合距离保持公共空间的相关损失。最后，提出了一种无监督跨模态相似性度量方法，进一步提高了检索性能。这是通过计算基于kNN分类器的两个媒体对象的边际概率来实现的。在4个公共数据集上对CDPAE进行了“按文本查询图像”和“按图像查询文本”两个跨模态检索任务的测试。与8种最先进的跨模态检索方法进行比较，实验结果表明CDPAE优于所有无监督方法，并与有监督方法具有竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 26th ACM international conference on Multimedia

自引率

0.00%

发文量