EMOv2: Pushing 5M Vision Model Frontier

IF 18.6

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-08-07 DOI:10.1109/TPAMI.2025.3596776

Jiangning Zhang;Teng Hu;Haoyang He;Zhucun Xue;Yabiao Wang;Chengjie Wang;Yong Liu;Xiangtai Li;Dacheng Tao

{"title":"EMOv2: Pushing 5M Vision Model Frontier","authors":"Jiangning Zhang;Teng Hu;Haoyang He;Zhucun Xue;Yabiao Wang;Chengjie Wang;Yong Liu;Xiangtai Li;Dacheng Tao","doi":"10.1109/TPAMI.2025.3596776","DOIUrl":null,"url":null,"abstract":"This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5 M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i<inline-formula><tex-math>$^{2}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mn>2</mml:mn></mml:msup></mml:math><inline-graphic></alternatives></inline-formula>RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4 G/5 G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5 M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1 M/2M/5 M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5 M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5 M by +2.6<inline-formula><tex-math>$\\uparrow$</tex-math></inline-formula> . When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 11","pages":"10560-10576"},"PeriodicalIF":18.6000,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11119331/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5 M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i$^{2}$2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4 G/5 G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5 M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1 M/2M/5 M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5 M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5 M by +2.6

$\uparrow$

. When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level.

查看原文本刊更多论文

EMOv2：推动5 M视觉模型前沿。

这项工作的重点是在权衡参数、FLOPs和性能的同时，为密集预测开发参数高效和轻量级模型。我们的目标是在各种下游任务上建立5m量级轻量级模型的新前沿。反向残差块（IRB）是轻量级cnn的基础结构，但目前尚未被基于注意力的设计所识别。我们的工作从统一的角度重新思考了Transformer中高效IRB和实用组件的轻量级基础架构，将基于cnn的IRB扩展到基于注意力的模型，并抽象了一个用于轻量级模型设计的单残留元移动块（MMBlock）。遵循简洁而有效的设计准则，我们推导出了一种现代改进的倒立剩余移动块（i22222222222222222222222222rmb），并改进了一个不需要复杂结构的分层高效模型（EMOv2）。考虑到移动用户在4g / 5g带宽下下载模型时难以察觉的延迟，并确保模型性能，我们研究了5m量级的轻量级模型的性能上限。在各种视觉识别、密集预测和图像生成任务上的大量实验表明，我们的EMOv2优于最先进的方法，例如EMOv2-1 M/2M/ 5m达到72.3,75.8，79.4个Top-1显著超过等阶CNN /基于注意力的模型。同时，配备EMOv2-5 M的retanet在目标检测任务中达到41.5 mAP，比以前的EMOv2-5 M高出+2.6$\ uprow $。当采用更稳健的训练配方时，我们的EMOv2-5M最终达到82.9的Top-1精度，将5M量级模型的性能提升到一个新的水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量