AutoViT: Achieving Real-Time Vision Transformers on Mobile via Latency-aware Coarse-to-Fine Search

IF 9.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2025-05-26 DOI:10.1007/s11263-025-02480-w

Zhenglun Kong, Dongkuan Xu, Zhengang Li, Peiyan Dong, Hao Tang, Yanzhi Wang, Subhabrata Mukherjee

{"title":"AutoViT: Achieving Real-Time Vision Transformers on Mobile via Latency-aware Coarse-to-Fine Search","authors":"Zhenglun Kong, Dongkuan Xu, Zhengang Li, Peiyan Dong, Hao Tang, Yanzhi Wang, Subhabrata Mukherjee","doi":"10.1007/s11263-025-02480-w","DOIUrl":null,"url":null,"abstract":"<p>Despite their impressive performance on various tasks, vision transformers (ViTs) are heavy for mobile vision applications. Recent works have proposed combining the strengths of ViTs and convolutional neural networks (CNNs) to build lightweight networks. Still, these approaches rely on hand-designed architectures with a pre-determined number of parameters. In this work, we address the challenge of finding optimal light-weight ViTs given constraints on model size and computational cost using neural architecture search. We use a search algorithm that considers both model parameters and on-device deployment latency. This method analyzes network properties, hardware memory access pattern, and degree of parallelism to directly and accurately estimate the network latency. To prevent the need for extensive testing during the search process, we use a lookup table based on a detailed breakdown of the speed of each component and operation, which can be reused to evaluate the whole latency of each search structure. Our approach leads to improved efficiency compared to testing the speed of the whole model during the search process. Extensive experiments demonstrate that, under similar parameters and FLOPs, our searched lightweight ViTs achieve higher accuracy and lower latency than state-of-the-art models. For instance, on ImageNet-1K, AutoViT_XXS (71.3% Top-1 accuracy, 10.2ms latency) outperforms MobileViTv3_XXS (71.0% Top-1 accuracy, 12.5ms latency) with 0.3% higher accuracy and 2.3ms lower latency.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"82 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-025-02480-w","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Despite their impressive performance on various tasks, vision transformers (ViTs) are heavy for mobile vision applications. Recent works have proposed combining the strengths of ViTs and convolutional neural networks (CNNs) to build lightweight networks. Still, these approaches rely on hand-designed architectures with a pre-determined number of parameters. In this work, we address the challenge of finding optimal light-weight ViTs given constraints on model size and computational cost using neural architecture search. We use a search algorithm that considers both model parameters and on-device deployment latency. This method analyzes network properties, hardware memory access pattern, and degree of parallelism to directly and accurately estimate the network latency. To prevent the need for extensive testing during the search process, we use a lookup table based on a detailed breakdown of the speed of each component and operation, which can be reused to evaluate the whole latency of each search structure. Our approach leads to improved efficiency compared to testing the speed of the whole model during the search process. Extensive experiments demonstrate that, under similar parameters and FLOPs, our searched lightweight ViTs achieve higher accuracy and lower latency than state-of-the-art models. For instance, on ImageNet-1K, AutoViT_XXS (71.3% Top-1 accuracy, 10.2ms latency) outperforms MobileViTv3_XXS (71.0% Top-1 accuracy, 12.5ms latency) with 0.3% higher accuracy and 2.3ms lower latency.

查看原文本刊更多论文

AutoViT：通过延迟感知的粗到精搜索在移动设备上实现实时视觉变形

尽管视觉变压器（vit）在各种任务上的表现令人印象深刻，但对于移动视觉应用来说，它仍然很重。最近的研究提出了结合vit和卷积神经网络（cnn）的优势来构建轻量级网络。尽管如此，这些方法依赖于手工设计的具有预先确定的参数数量的体系结构。在这项工作中，我们使用神经结构搜索解决了在给定模型大小和计算成本约束下寻找最优轻量级vit的挑战。我们使用一种同时考虑模型参数和设备上部署延迟的搜索算法。该方法通过分析网络特性、硬件内存访问模式和并行度来直接准确地估计网络延迟。为了避免在搜索过程中需要进行大量的测试，我们使用了一个基于每个组件和操作的详细速度分解的查找表，可以重用它来评估每个搜索结构的整体延迟。与在搜索过程中测试整个模型的速度相比，我们的方法提高了效率。大量的实验表明，在相似的参数和FLOPs下，我们搜索的轻量级ViTs比最先进的模型具有更高的精度和更低的延迟。例如，在ImageNet-1K上，AutoViT_XXS （71.3% Top-1精度，10.2ms延迟）优于MobileViTv3_XXS （71.0% Top-1精度，12.5ms延迟），精度提高0.3%，延迟降低2.3ms。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.