EMOv2: Pushing 5M Vision Model Frontier

IF 18.6
Jiangning Zhang;Teng Hu;Haoyang He;Zhucun Xue;Yabiao Wang;Chengjie Wang;Yong Liu;Xiangtai Li;Dacheng Tao
{"title":"EMOv2: Pushing 5M Vision Model Frontier","authors":"Jiangning Zhang;Teng Hu;Haoyang He;Zhucun Xue;Yabiao Wang;Chengjie Wang;Yong Liu;Xiangtai Li;Dacheng Tao","doi":"10.1109/TPAMI.2025.3596776","DOIUrl":null,"url":null,"abstract":"This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5 M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern <b>I</b>mproved <b>I</b>nverted <b>R</b>esidual <b>M</b>obile <b>B</b>lock (<b>i<inline-formula><tex-math>$^{2}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mn>2</mml:mn></mml:msup></mml:math><inline-graphic></alternatives></inline-formula>RMB</b>) and improve a hierarchical Efficient MOdel (<b>EMOv2</b>) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4 G/5 G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5 M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1 M/2M/5 M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5 M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5 M by +2.6<inline-formula><tex-math>$\\uparrow$</tex-math></inline-formula> . When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 11","pages":"10560-10576"},"PeriodicalIF":18.6000,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11119331/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5 M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i$^{2}$2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4 G/5 G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5 M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1 M/2M/5 M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5 M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5 M by +2.6$\uparrow$ . When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level.
EMOv2:推动5 M视觉模型前沿。
这项工作的重点是在权衡参数、FLOPs和性能的同时,为密集预测开发参数高效和轻量级模型。我们的目标是在各种下游任务上建立5m量级轻量级模型的新前沿。反向残差块(IRB)是轻量级cnn的基础结构,但目前尚未被基于注意力的设计所识别。我们的工作从统一的角度重新思考了Transformer中高效IRB和实用组件的轻量级基础架构,将基于cnn的IRB扩展到基于注意力的模型,并抽象了一个用于轻量级模型设计的单残留元移动块(MMBlock)。遵循简洁而有效的设计准则,我们推导出了一种现代改进的倒立剩余移动块(i22222222222222222222222222rmb),并改进了一个不需要复杂结构的分层高效模型(EMOv2)。考虑到移动用户在4g / 5g带宽下下载模型时难以察觉的延迟,并确保模型性能,我们研究了5m量级的轻量级模型的性能上限。在各种视觉识别、密集预测和图像生成任务上的大量实验表明,我们的EMOv2优于最先进的方法,例如EMOv2-1 M/2M/ 5m达到72.3,75.8,79.4个Top-1显著超过等阶CNN /基于注意力的模型。同时,配备EMOv2-5 M的retanet在目标检测任务中达到41.5 mAP,比以前的EMOv2-5 M高出+2.6$\ uprow $。当采用更稳健的训练配方时,我们的EMOv2-5M最终达到82.9的Top-1精度,将5M量级模型的性能提升到一个新的水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信