Xing Xu, Fumin Shen, Yang Yang, Jie Shao, Zi Huang
{"title":"零次学习的换向视觉语义嵌入","authors":"Xing Xu, Fumin Shen, Yang Yang, Jie Shao, Zi Huang","doi":"10.1145/3078971.3078977","DOIUrl":null,"url":null,"abstract":"Zero-shot learning (ZSL) aims to bridge the knowledge transfer via available semantic representations (e.g., attributes) between labeled source instances of seen classes and unlabelled target instances of unseen classes. Most existing ZSL approaches achieve this by learning a projection from the visual feature space to the semantic representation space based on the source instances, and directly applying it to the target instances. However, the intrinsic manifold structures residing in both semantic representations and visual features are not effectively incorporated into the learned projection function. Moreover, these methods may suffer from the inherent projection shift problem, due to the disjointness between seen and unseen classes. To overcome these drawbacks, we propose a novel framework termed transductive visual-semantic embedding (TVSE) for ZSL. In specific, TVSE first learns a latent embedding space to incorporate the manifold structures in both labeled source instances and unlabeled target instances under the transductive setting. In the learned space, each instance is viewed as a mixture of seen class scores. TVSE then effectively constructs the relational mapping between seen and unseen classes using the available semantic representations, and applies it to map the seen class scores of the target instances to their predictions of unseen classes. Extensive experiments on four benchmark datasets demonstrate that the proposed TVSE achieves competitive performance compared with the state-of-the-arts for zero-shot recognition and retrieval tasks.","PeriodicalId":403556,"journal":{"name":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"Transductive Visual-Semantic Embedding for Zero-shot Learning\",\"authors\":\"Xing Xu, Fumin Shen, Yang Yang, Jie Shao, Zi Huang\",\"doi\":\"10.1145/3078971.3078977\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Zero-shot learning (ZSL) aims to bridge the knowledge transfer via available semantic representations (e.g., attributes) between labeled source instances of seen classes and unlabelled target instances of unseen classes. Most existing ZSL approaches achieve this by learning a projection from the visual feature space to the semantic representation space based on the source instances, and directly applying it to the target instances. However, the intrinsic manifold structures residing in both semantic representations and visual features are not effectively incorporated into the learned projection function. Moreover, these methods may suffer from the inherent projection shift problem, due to the disjointness between seen and unseen classes. To overcome these drawbacks, we propose a novel framework termed transductive visual-semantic embedding (TVSE) for ZSL. In specific, TVSE first learns a latent embedding space to incorporate the manifold structures in both labeled source instances and unlabeled target instances under the transductive setting. In the learned space, each instance is viewed as a mixture of seen class scores. TVSE then effectively constructs the relational mapping between seen and unseen classes using the available semantic representations, and applies it to map the seen class scores of the target instances to their predictions of unseen classes. Extensive experiments on four benchmark datasets demonstrate that the proposed TVSE achieves competitive performance compared with the state-of-the-arts for zero-shot recognition and retrieval tasks.\",\"PeriodicalId\":403556,\"journal\":{\"name\":\"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3078971.3078977\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078971.3078977","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Transductive Visual-Semantic Embedding for Zero-shot Learning
Zero-shot learning (ZSL) aims to bridge the knowledge transfer via available semantic representations (e.g., attributes) between labeled source instances of seen classes and unlabelled target instances of unseen classes. Most existing ZSL approaches achieve this by learning a projection from the visual feature space to the semantic representation space based on the source instances, and directly applying it to the target instances. However, the intrinsic manifold structures residing in both semantic representations and visual features are not effectively incorporated into the learned projection function. Moreover, these methods may suffer from the inherent projection shift problem, due to the disjointness between seen and unseen classes. To overcome these drawbacks, we propose a novel framework termed transductive visual-semantic embedding (TVSE) for ZSL. In specific, TVSE first learns a latent embedding space to incorporate the manifold structures in both labeled source instances and unlabeled target instances under the transductive setting. In the learned space, each instance is viewed as a mixture of seen class scores. TVSE then effectively constructs the relational mapping between seen and unseen classes using the available semantic representations, and applies it to map the seen class scores of the target instances to their predictions of unseen classes. Extensive experiments on four benchmark datasets demonstrate that the proposed TVSE achieves competitive performance compared with the state-of-the-arts for zero-shot recognition and retrieval tasks.