Transductive Visual-Semantic Embedding for Zero-shot Learning

Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval Pub Date : 2017-06-06 DOI:10.1145/3078971.3078977

Xing Xu, Fumin Shen, Yang Yang, Jie Shao, Zi Huang

{"title":"Transductive Visual-Semantic Embedding for Zero-shot Learning","authors":"Xing Xu, Fumin Shen, Yang Yang, Jie Shao, Zi Huang","doi":"10.1145/3078971.3078977","DOIUrl":null,"url":null,"abstract":"Zero-shot learning (ZSL) aims to bridge the knowledge transfer via available semantic representations (e.g., attributes) between labeled source instances of seen classes and unlabelled target instances of unseen classes. Most existing ZSL approaches achieve this by learning a projection from the visual feature space to the semantic representation space based on the source instances, and directly applying it to the target instances. However, the intrinsic manifold structures residing in both semantic representations and visual features are not effectively incorporated into the learned projection function. Moreover, these methods may suffer from the inherent projection shift problem, due to the disjointness between seen and unseen classes. To overcome these drawbacks, we propose a novel framework termed transductive visual-semantic embedding (TVSE) for ZSL. In specific, TVSE first learns a latent embedding space to incorporate the manifold structures in both labeled source instances and unlabeled target instances under the transductive setting. In the learned space, each instance is viewed as a mixture of seen class scores. TVSE then effectively constructs the relational mapping between seen and unseen classes using the available semantic representations, and applies it to map the seen class scores of the target instances to their predictions of unseen classes. Extensive experiments on four benchmark datasets demonstrate that the proposed TVSE achieves competitive performance compared with the state-of-the-arts for zero-shot recognition and retrieval tasks.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078971.3078977","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

Zero-shot learning (ZSL) aims to bridge the knowledge transfer via available semantic representations (e.g., attributes) between labeled source instances of seen classes and unlabelled target instances of unseen classes. Most existing ZSL approaches achieve this by learning a projection from the visual feature space to the semantic representation space based on the source instances, and directly applying it to the target instances. However, the intrinsic manifold structures residing in both semantic representations and visual features are not effectively incorporated into the learned projection function. Moreover, these methods may suffer from the inherent projection shift problem, due to the disjointness between seen and unseen classes. To overcome these drawbacks, we propose a novel framework termed transductive visual-semantic embedding (TVSE) for ZSL. In specific, TVSE first learns a latent embedding space to incorporate the manifold structures in both labeled source instances and unlabeled target instances under the transductive setting. In the learned space, each instance is viewed as a mixture of seen class scores. TVSE then effectively constructs the relational mapping between seen and unseen classes using the available semantic representations, and applies it to map the seen class scores of the target instances to their predictions of unseen classes. Extensive experiments on four benchmark datasets demonstrate that the proposed TVSE achieves competitive performance compared with the state-of-the-arts for zero-shot recognition and retrieval tasks.

查看原文本刊更多论文

零次学习的换向视觉语义嵌入

零射击学习(Zero-shot learning, ZSL)旨在通过可用的语义表示(例如，属性)在标记的可见类源实例和未标记的不可见类目标实例之间架起知识转移的桥梁。大多数现有的ZSL方法通过学习从视觉特征空间到基于源实例的语义表示空间的投影，并直接将其应用于目标实例来实现这一点。然而，存在于语义表示和视觉特征中的内在流形结构并没有有效地融入到学习到的投影函数中。此外，由于可见类和不可见类之间的不连接，这些方法可能遭受固有的投影移位问题。为了克服这些缺点，我们提出了一种新的框架，称为转换视觉语义嵌入(TVSE)。具体来说，TVSE首先学习一个潜在的嵌入空间，在转换设置下，将标记的源实例和未标记的目标实例中的流形结构结合起来。在学习空间中，每个实例都被看作是看到的班级分数的混合体。然后，TVSE使用可用的语义表示有效地构建可见类和不可见类之间的关系映射，并将其应用于将目标实例的可见类分数映射到它们对不可见类的预测。在四个基准数据集上的大量实验表明，与目前的技术相比，所提出的TVSE在零射击识别和检索任务中具有竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

自引率

0.00%

发文量