Two-Stream Transformer for Multi-Label Image Classification

Proceedings of the 30th ACM International Conference on Multimedia Pub Date : 2022-10-10 DOI:10.1145/3503161.3548343

Xueling Zhu, Jiuxin Cao, Jiawei Ge, Weijia Liu, Bo Liu

{"title":"Two-Stream Transformer for Multi-Label Image Classification","authors":"Xueling Zhu, Jiuxin Cao, Jiawei Ge, Weijia Liu, Bo Liu","doi":"10.1145/3503161.3548343","DOIUrl":null,"url":null,"abstract":"Multi-label image classification is a fundamental yet challenging task in computer vision that aims to identify multiple objects from a given image. Recent studies on this task mainly focus on learning cross-modal interactions between label semantics and high-level visual representations via an attention operation. However, these one-shot attention based approaches generally perform poorly in establishing accurate and robust alignments between vision and text due to the acknowledged semantic gap. In this paper, we propose a two-stream transformer (TSFormer) learning framework, in which the spatial stream focuses on extracting patch features with a global perception, while the semantic stream aims to learn vision-aware label semantics as well as their correlations via a multi-shot attention mechanism. Specifically, in each layer of TSFormer, a cross-modal attention module is developed to aggregate visual features from spatial stream into semantic stream and update label semantics via a residual connection. In this way, the semantic gap between two streams gradually narrows as the procedure progresses layer by layer, allowing the semantic stream to produce sophisticated visual representations for each label towards accurate label recognition. Extensive experiments on three visual benchmarks, including Pascal VOC 2007, Microsoft COCO and NUS-WIDE, consistently demonstrate that our proposed TSFormer achieves state-of-the-art performance on the multi-label image classification task.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"5 3","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503161.3548343","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Multi-label image classification is a fundamental yet challenging task in computer vision that aims to identify multiple objects from a given image. Recent studies on this task mainly focus on learning cross-modal interactions between label semantics and high-level visual representations via an attention operation. However, these one-shot attention based approaches generally perform poorly in establishing accurate and robust alignments between vision and text due to the acknowledged semantic gap. In this paper, we propose a two-stream transformer (TSFormer) learning framework, in which the spatial stream focuses on extracting patch features with a global perception, while the semantic stream aims to learn vision-aware label semantics as well as their correlations via a multi-shot attention mechanism. Specifically, in each layer of TSFormer, a cross-modal attention module is developed to aggregate visual features from spatial stream into semantic stream and update label semantics via a residual connection. In this way, the semantic gap between two streams gradually narrows as the procedure progresses layer by layer, allowing the semantic stream to produce sophisticated visual representations for each label towards accurate label recognition. Extensive experiments on three visual benchmarks, including Pascal VOC 2007, Microsoft COCO and NUS-WIDE, consistently demonstrate that our proposed TSFormer achieves state-of-the-art performance on the multi-label image classification task.

查看原文本刊更多论文

用于多标签图像分类的双流变压器

多标签图像分类是计算机视觉中的一项基本但具有挑战性的任务，旨在从给定图像中识别多个对象。最近的研究主要集中在通过注意操作学习标签语义和高级视觉表征之间的跨模态交互。然而，由于已知的语义差距，这些基于一次性注意力的方法通常在视觉和文本之间建立准确和稳健的对齐方面表现不佳。在本文中，我们提出了一个两流转换器(TSFormer)学习框架，其中空间流侧重于提取具有全局感知的贴片特征，而语义流旨在通过多镜头注意机制学习视觉感知的标签语义及其相关性。具体来说，在TSFormer的每一层都开发了一个跨模态注意模块，将空间流中的视觉特征聚合到语义流中，并通过残差连接更新标签语义。这样，随着过程的逐层进行，两个流之间的语义差距逐渐缩小，使语义流能够为每个标签产生复杂的视觉表示，从而实现准确的标签识别。在包括Pascal VOC 2007、Microsoft COCO和NUS-WIDE在内的三个视觉基准上进行的大量实验一致表明，我们提出的TSFormer在多标签图像分类任务上实现了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 30th ACM International Conference on Multimedia

自引率

0.00%

发文量