Jiangning Zhang;Teng Hu;Haoyang He;Zhucun Xue;Yabiao Wang;Chengjie Wang;Yong Liu;Xiangtai Li;Dacheng Tao
{"title":"EMOv2: Pushing 5M Vision Model Frontier","authors":"Jiangning Zhang;Teng Hu;Haoyang He;Zhucun Xue;Yabiao Wang;Chengjie Wang;Yong Liu;Xiangtai Li;Dacheng Tao","doi":"10.1109/TPAMI.2025.3596776","DOIUrl":null,"url":null,"abstract":"This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5 M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern <b>I</b>mproved <b>I</b>nverted <b>R</b>esidual <b>M</b>obile <b>B</b>lock (<b>i<inline-formula><tex-math>$^{2}$</tex-math><alternatives><mml:math><mml:msup><mml:mrow/><mml:mn>2</mml:mn></mml:msup></mml:math><inline-graphic></alternatives></inline-formula>RMB</b>) and improve a hierarchical Efficient MOdel (<b>EMOv2</b>) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4 G/5 G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5 M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1 M/2M/5 M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5 M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5 M by +2.6<inline-formula><tex-math>$\\uparrow$</tex-math></inline-formula> . When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 11","pages":"10560-10576"},"PeriodicalIF":18.6000,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11119331/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This work focuses on developing parameter-efficient and lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Our goal is to set up the new frontier of the 5 M magnitude lightweight model on various downstream tasks. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterparts have been recognized by attention-based design. Our work rethinks the lightweight infrastructure of efficient IRB and practical components in Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMBlock) for lightweight model design. Following neat but effective design criterion, we deduce a modern Improved Inverted Residual Mobile Block (i$^{2}$2RMB) and improve a hierarchical Efficient MOdel (EMOv2) with no elaborate complex structures. Considering the imperceptible latency for mobile users when downloading models under 4 G/5 G bandwidth and ensuring model performance, we investigate the performance upper limit of lightweight models with a magnitude of 5 M. Extensive experiments on various vision recognition, dense prediction, and image generation tasks demonstrate the superiority of our EMOv2 over state-of-the-art methods, e.g., EMOv2-1 M/2M/5 M achieve 72.3, 75.8, and 79.4 Top-1 that surpass equal-order CNN-/Attention-based models significantly. At the same time, EMOv2-5 M equipped RetinaNet achieves 41.5 mAP for object detection tasks that surpasses the previous EMO-5 M by +2.6$\uparrow$ . When employing the more robust training recipe, our EMOv2-5M eventually achieves 82.9 Top-1 accuracy, which elevates the performance of 5M magnitude models to a new level.