Chaojian Li, Kyungmin Kim, Bichen Wu, Peizhao Zhang, Hang Zhang, Xiaoliang Dai, Péter Vajda, Y. Lin
{"title":"An Investigation on Hardware-Aware Vision Transformer Scaling","authors":"Chaojian Li, Kyungmin Kim, Bichen Wu, Peizhao Zhang, Hang Zhang, Xiaoliang Dai, Péter Vajda, Y. Lin","doi":"10.1145/3611387","DOIUrl":null,"url":null,"abstract":"Vision Transformer (ViT) has demonstrated promising performance in various computer vision tasks, and recently attracted a lot of research attention. Many recent works have focused on proposing new architectures to improve ViT and deploying it into real-world applications. However, little effort has been made to analyze and understand ViT’s architecture design space and its implication of hardware-cost on different devices. In this work, by simply scaling ViT’s depth, width, input size, and other basic configurations, we show that a scaled vanilla ViT model without bells and whistles can achieve comparable or superior accuracy-efficiency trade-off than most of the latest ViT variants. Specifically, compared to DeiT-Tiny, our scaled model achieves a \\(\\uparrow 1.9\\% \\) higher ImageNet top-1 accuracy under the same FLOPs and a \\(\\uparrow 3.7\\% \\) better ImageNet top-1 accuracy under the same latency on an NVIDIA Edge GPU TX2. Motivated by this, we further investigate the extracted scaling strategies from the following two aspects: (1) “can these scaling strategies be transferred across different real hardware devices?”; and (2) “can these scaling strategies be transferred to different ViT variants and tasks?”. For (1), our exploration, based on various devices with different resource budgets, indicates that the transferability effectiveness depends on the underlying device together with its corresponding deployment tool; for (2), we validate the effective transferability of the aforementioned scaling strategies obtained from a vanilla ViT model on top of an image classification task to the PiT model, a strong ViT variant targeting efficiency, as well as object detection and video classification tasks. In particular, when transferred to PiT, our scaling strategies lead to a boosted ImageNet top-1 accuracy of from \\(74.6\\% \\) to \\(76.7\\% \\) ( \\(\\uparrow 2.1\\% \\) ) under the same 0.7G FLOPs; and when transferred to the COCO object detection task, the average precision is boosted by \\(\\uparrow 0.7\\% \\) under a similar throughput on a V100 GPU.","PeriodicalId":50914,"journal":{"name":"ACM Transactions on Embedded Computing Systems","volume":" ","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2023-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Embedded Computing Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3611387","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Vision Transformer (ViT) has demonstrated promising performance in various computer vision tasks, and recently attracted a lot of research attention. Many recent works have focused on proposing new architectures to improve ViT and deploying it into real-world applications. However, little effort has been made to analyze and understand ViT’s architecture design space and its implication of hardware-cost on different devices. In this work, by simply scaling ViT’s depth, width, input size, and other basic configurations, we show that a scaled vanilla ViT model without bells and whistles can achieve comparable or superior accuracy-efficiency trade-off than most of the latest ViT variants. Specifically, compared to DeiT-Tiny, our scaled model achieves a \(\uparrow 1.9\% \) higher ImageNet top-1 accuracy under the same FLOPs and a \(\uparrow 3.7\% \) better ImageNet top-1 accuracy under the same latency on an NVIDIA Edge GPU TX2. Motivated by this, we further investigate the extracted scaling strategies from the following two aspects: (1) “can these scaling strategies be transferred across different real hardware devices?”; and (2) “can these scaling strategies be transferred to different ViT variants and tasks?”. For (1), our exploration, based on various devices with different resource budgets, indicates that the transferability effectiveness depends on the underlying device together with its corresponding deployment tool; for (2), we validate the effective transferability of the aforementioned scaling strategies obtained from a vanilla ViT model on top of an image classification task to the PiT model, a strong ViT variant targeting efficiency, as well as object detection and video classification tasks. In particular, when transferred to PiT, our scaling strategies lead to a boosted ImageNet top-1 accuracy of from \(74.6\% \) to \(76.7\% \) ( \(\uparrow 2.1\% \) ) under the same 0.7G FLOPs; and when transferred to the COCO object detection task, the average precision is boosted by \(\uparrow 0.7\% \) under a similar throughput on a V100 GPU.
期刊介绍:
The design of embedded computing systems, both the software and hardware, increasingly relies on sophisticated algorithms, analytical models, and methodologies. ACM Transactions on Embedded Computing Systems (TECS) aims to present the leading work relating to the analysis, design, behavior, and experience with embedded computing systems.