Token Adaptive Vision Transformer with Efficient Deployment for Fine-Grained Image Recognition

Chonghan Lee, Rita Brugarolas Brufau, Ke Ding, N. Vijaykrishnan
{"title":"Token Adaptive Vision Transformer with Efficient Deployment for Fine-Grained Image Recognition","authors":"Chonghan Lee, Rita Brugarolas Brufau, Ke Ding, N. Vijaykrishnan","doi":"10.23919/DATE56975.2023.10137239","DOIUrl":null,"url":null,"abstract":"Fine-grained Visual Classification (FGVC) aims to distinguish object classes belonging to the same category, e.g., different bird species or models of vehicles. The task is more challenging than ordinary image classification due to the subtle inter-class differences. Recent works proposed deep learning models based on the vision transformer (ViT) architecture with its self-attention mechanism to locate important regions of the objects and derive global information. However, deploying them on resource-restricted internet of things (IoT) devices is challenging due to their intensive computational cost and memory footprint. Energy and power consumption varies in different IoT devices. To improve their inference efficiency, previous approaches require manually designing the model architecture and training a separate model for each computational budget. In this work, we propose Token Adaptive Vision Transformer (TAVT) that dynamically drops out tokens and can be used for various inference scenarios across many IoT devices after training the model once. Our adaptive model can switch among different token drop configurations at run time, providing instant accuracy-efficiency trade-offs. We train a vision transformer with a progressive token pruning scheme, eliminating a large number of redundant tokens in the later layers. We then conduct a multi-objective evolutionary search with the overall number of floating point operations (FLOPs) as its efficiency constraint that could be translated to energy consumption and power to find the token pruning schemes that maximize accuracy and efficiency under various computational budgets. Empirical results show that our proposed TAVT dramatically speeds up the GPU inference latency by up to 10× and reduces memory requirements and FLOPs by up to 5.5 × and 13 × respectively while achieving competitive accuracy compared to prior ViT-based state-of-the-art approaches.","PeriodicalId":340349,"journal":{"name":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 Design, Automation & Test in Europe Conference & Exhibition (DATE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/DATE56975.2023.10137239","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Fine-grained Visual Classification (FGVC) aims to distinguish object classes belonging to the same category, e.g., different bird species or models of vehicles. The task is more challenging than ordinary image classification due to the subtle inter-class differences. Recent works proposed deep learning models based on the vision transformer (ViT) architecture with its self-attention mechanism to locate important regions of the objects and derive global information. However, deploying them on resource-restricted internet of things (IoT) devices is challenging due to their intensive computational cost and memory footprint. Energy and power consumption varies in different IoT devices. To improve their inference efficiency, previous approaches require manually designing the model architecture and training a separate model for each computational budget. In this work, we propose Token Adaptive Vision Transformer (TAVT) that dynamically drops out tokens and can be used for various inference scenarios across many IoT devices after training the model once. Our adaptive model can switch among different token drop configurations at run time, providing instant accuracy-efficiency trade-offs. We train a vision transformer with a progressive token pruning scheme, eliminating a large number of redundant tokens in the later layers. We then conduct a multi-objective evolutionary search with the overall number of floating point operations (FLOPs) as its efficiency constraint that could be translated to energy consumption and power to find the token pruning schemes that maximize accuracy and efficiency under various computational budgets. Empirical results show that our proposed TAVT dramatically speeds up the GPU inference latency by up to 10× and reduces memory requirements and FLOPs by up to 5.5 × and 13 × respectively while achieving competitive accuracy compared to prior ViT-based state-of-the-art approaches.
用于细粒度图像识别的高效部署令牌自适应视觉转换器
细粒度视觉分类(FGVC)旨在区分属于同一类别的物体类别,例如不同的鸟类或车辆型号。由于类间的细微差异,该任务比普通图像分类更具挑战性。最近的研究提出了基于视觉转换器(ViT)架构的深度学习模型,该模型具有自关注机制,可以定位物体的重要区域并获得全局信息。然而,由于其密集的计算成本和内存占用,在资源有限的物联网(IoT)设备上部署它们是具有挑战性的。不同物联网设备的能耗和功耗各不相同。为了提高推理效率,以前的方法需要手动设计模型架构,并为每个计算预算训练一个单独的模型。在这项工作中,我们提出了Token自适应视觉转换器(TAVT),它可以动态地丢弃Token,并且可以在训练模型一次后用于跨许多物联网设备的各种推断场景。我们的自适应模型可以在运行时在不同的令牌丢弃配置之间切换,提供即时的准确性和效率权衡。我们使用渐进式标记修剪方案训练视觉转换器,消除了后面层的大量冗余标记。然后,我们进行多目标进化搜索,以浮点运算(FLOPs)的总数作为效率约束,可以转换为能耗和功率,以找到在各种计算预算下最大化准确性和效率的令牌修剪方案。实证结果表明,我们提出的TAVT显着加快了GPU推理延迟高达10倍,并将内存需求和FLOPs分别降低了5.5倍和13倍,同时与先前基于vit的最先进方法相比,实现了具有竞争力的精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信