Token Pooling in Vision Transformers for Image Classification

D. Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish K. Prabhu, Mohammad Rastegari, Oncel Tuzel
{"title":"Token Pooling in Vision Transformers for Image Classification","authors":"D. Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish K. Prabhu, Mohammad Rastegari, Oncel Tuzel","doi":"10.1109/WACV56688.2023.00010","DOIUrl":null,"url":null,"abstract":"Pooling is commonly used to improve the computation-accuracy trade-off of convolutional networks. By aggregating neighboring feature values on the image grid, pooling layers downsample feature maps while maintaining accuracy. In standard vision transformers, however, tokens are processed individually and do not necessarily lie on regular grids. Utilizing pooling methods designed for image grids (e.g., average pooling) thus can be sub-optimal for transformers, as shown by our experiments. In this paper, we propose Token Pooling to downsample token sets in vision transformers. We take a new perspective — instead of assuming tokens form a regular grid, we treat them as discrete (and irregular) samples of an implicit continuous signal. Given a target number of tokens, Token Pooling finds the set of tokens that best approximates the underlying continuous signal. We rigorously evaluate the proposed method on the standard transformer architecture (ViT/DeiT) and on the image classification problem using ImageNet-1k. Our experiments show that Token Pooling significantly improves the computation-accuracy trade-off without any further modifications to the architecture. Token Pooling enables DeiT-Ti to achieve the same top-1 accuracy while using 42% fewer computations.","PeriodicalId":270631,"journal":{"name":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACV56688.2023.00010","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Pooling is commonly used to improve the computation-accuracy trade-off of convolutional networks. By aggregating neighboring feature values on the image grid, pooling layers downsample feature maps while maintaining accuracy. In standard vision transformers, however, tokens are processed individually and do not necessarily lie on regular grids. Utilizing pooling methods designed for image grids (e.g., average pooling) thus can be sub-optimal for transformers, as shown by our experiments. In this paper, we propose Token Pooling to downsample token sets in vision transformers. We take a new perspective — instead of assuming tokens form a regular grid, we treat them as discrete (and irregular) samples of an implicit continuous signal. Given a target number of tokens, Token Pooling finds the set of tokens that best approximates the underlying continuous signal. We rigorously evaluate the proposed method on the standard transformer architecture (ViT/DeiT) and on the image classification problem using ImageNet-1k. Our experiments show that Token Pooling significantly improves the computation-accuracy trade-off without any further modifications to the architecture. Token Pooling enables DeiT-Ti to achieve the same top-1 accuracy while using 42% fewer computations.
用于图像分类的视觉变换令牌池
池化通常用于改善卷积网络的计算精度权衡。通过在图像网格上聚合相邻的特征值,在保持精度的前提下对特征图进行层池化。然而,在标准的视觉变压器中,符号是单独处理的,不一定位于规则的网格上。因此,利用为图像网格设计的池化方法(例如,平均池化)对于变压器来说可能不是最优的,正如我们的实验所示。在本文中,我们提出了令牌池来对视觉转换器中的令牌集进行采样。我们采用了一种新的视角——我们不再假设符号形成一个规则的网格,而是将它们视为一个隐式连续信号的离散(和不规则)样本。给定目标数量的令牌,令牌池会找到最接近底层连续信号的令牌集。我们在标准变压器架构(ViT/DeiT)和使用ImageNet-1k的图像分类问题上严格评估了所提出的方法。我们的实验表明,令牌池显着改善了计算精度的权衡,而无需进一步修改架构。令牌池使DeiT-Ti能够在减少42%的计算量的同时实现相同的top-1精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信