TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2022-12-15 DOI:10.1109/TPAMI.2022.3229526

Shiming Chen;Ziming Hong;Wenjin Hou;Guo-Sen Xie;Yibing Song;Jian Zhao;Xinge You;Shuicheng Yan;Ling Shao

{"title":"TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning","authors":"Shiming Chen;Ziming Hong;Wenjin Hou;Guo-Sen Xie;Yibing Song;Jian Zhao;Xinge You;Shuicheng Yan;Ling Shao","doi":"10.1109/TPAMI.2022.3229526","DOIUrl":null,"url":null,"abstract":"Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones. Semantic knowledge is typically represented by attribute descriptions shared between different classes, which act as strong priors for localizing object attributes that represent discriminative region features, enabling significant and sufficient visual-semantic interaction for advancing ZSL. Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention, which ignore the transferable and discriminative attribute localization of visual features for representing the key semantic knowledge for effective knowledge transfer in ZSL. In this paper, we propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for key semantic knowledge representations in ZSL. Specifically, TransZero++ employs an attribute \n<inline-formula><tex-math>$\\rightarrow$</tex-math></inline-formula>\n visual Transformer sub-net (AVT) and a visual \n<inline-formula><tex-math>$\\rightarrow$</tex-math></inline-formula>\n attribute Transformer sub-net (VAT) to learn attribute-based visual features and visual-based attribute features, respectively. By further introducing feature-level and prediction-level semantical collaborative losses, the two attribute-guided transformers teach each other to learn semantic-augmented visual embeddings for key semantic knowledge representations via semantical collaborative learning. Finally, the semantic-augmented visual embeddings learned by AVT and VAT are fused to conduct desirable visual-semantic interaction cooperated with class semantic vectors for ZSL classification. Extensive experiments show that TransZero++ achieves the new state-of-the-art results on three golden ZSL benchmarks and on the large-scale ImageNet dataset. The project website is available at: \n<uri>https://shiming-chen.github.io/TransZero-pp/TransZero-pp.html</uri>\n.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"45 11","pages":"12844-12861"},"PeriodicalIF":18.6000,"publicationDate":"2022-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/9987664/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones. Semantic knowledge is typically represented by attribute descriptions shared between different classes, which act as strong priors for localizing object attributes that represent discriminative region features, enabling significant and sufficient visual-semantic interaction for advancing ZSL. Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention, which ignore the transferable and discriminative attribute localization of visual features for representing the key semantic knowledge for effective knowledge transfer in ZSL. In this paper, we propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for key semantic knowledge representations in ZSL. Specifically, TransZero++ employs an attribute

$\rightarrow$

visual Transformer sub-net (AVT) and a visual

$\rightarrow$

attribute Transformer sub-net (VAT) to learn attribute-based visual features and visual-based attribute features, respectively. By further introducing feature-level and prediction-level semantical collaborative losses, the two attribute-guided transformers teach each other to learn semantic-augmented visual embeddings for key semantic knowledge representations via semantical collaborative learning. Finally, the semantic-augmented visual embeddings learned by AVT and VAT are fused to conduct desirable visual-semantic interaction cooperated with class semantic vectors for ZSL classification. Extensive experiments show that TransZero++ achieves the new state-of-the-art results on three golden ZSL benchmarks and on the large-scale ImageNet dataset. The project website is available at: https://shiming-chen.github.io/TransZero-pp/TransZero-pp.html .

查看原文本刊更多论文

TransZero++：用于零样本学习的交叉属性导向变换器

零样本学习（ZSL）通过将语义知识从可见类转移到不可见类来解决新的类识别问题。语义知识通常由不同类别之间共享的属性描述来表示，这些属性描述充当定位表示有区别的区域特征的对象属性的强先验，为推进ZSL实现显著和充分的视觉语义交互。现有的基于注意力的模型很难通过单独使用单向注意力来学习单个图像中的劣质区域特征，这些模型忽略了视觉特征的可转移和判别属性定位，用于表示ZSL中有效知识转移的关键语义知识。在本文中，我们提出了一种跨属性引导的Transformer网络，称为TransZero++，以细化ZSL中关键语义知识表示的视觉特征并学习准确的属性定位。具体而言，TransZero++使用属性$\rightarrow$visual Transformer子网（AVT）和属性$\right arrow$attribute Transformer子网络（VAT）来分别学习基于属性的视觉特征和基于视觉的属性特征。通过进一步引入特征级和预测级的语义协作损失，这两个属性引导的转换器通过语义协作学习，相互学习关键语义知识表示的语义增强视觉嵌入。最后，将AVT和VAT学习到的语义增强的视觉嵌入融合在一起，与类语义向量进行理想的视觉语义交互，用于ZSL分类。大量实验表明，TransZero++在三个黄金ZSL基准测试和大规模ImageNet数据集上实现了最先进的结果。项目网站位于：https://shiming-chen.github.io/TransZero-pp/TransZero-pp.html.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量