Efficient Mamba: Overcoming the visual limitations of Mamba with innovative structures

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-06-12 DOI:10.1016/j.imavis.2025.105569

Wei Xu , Yi Wan , Dong Zhao , Long Zhang

{"title":"Efficient Mamba: Overcoming the visual limitations of Mamba with innovative structures","authors":"Wei Xu , Yi Wan , Dong Zhao , Long Zhang","doi":"10.1016/j.imavis.2025.105569","DOIUrl":null,"url":null,"abstract":"<div><div>Mamba models have emerged as strong competitors to Transformers due to their efficient long-sequence processing and high memory efficiency. However, their state space models (SSMs) suffer from limitations in capturing long-range dependencies, lack of channel interactions, and weak generalization in vision tasks.</div><div>To address these issues, we propose Efficient Mamba (EMB), an innovative framework that enhances SSMs while integrating convolutional neural networks (CNNs) and Transformers to mitigate their inherent drawbacks. The key contributions of EMB are as follows: (1) We introduce the TransSSM module, which incorporates feature flipping and channel shuffle to enhance channel interactions and improve generalization. Additionally, we propose the Window Spatial Attention (WSA) module for precise local feature modeling and Dual Pooling Attention (DPA) to improve global feature modeling and model stability. (2) We design the MFB-SCFB composite structure, which integrates TransSSM, WSA, Inverted Residual Block(IRBs), and convolutional attention modules to facilitate effective global–local feature interaction.</div><div>EMB achieves state-of-the-art (SOTA) performance across multiple vision tasks. For instance, on ImageNet classification, EMB-S/T/N achieves Top-1 accuracies of 78.9%, 76.3%, and 73.5%, with model sizes and FLOPs of 5.9M/1.5G, 2.5M/0.6G, and 1.4M/0.3G, respectively, when trained on a single NVIDIA 4090 GPU.</div><div>Experimental results demonstrate that EMB provides a novel paradigm for efficient vision model design, offering valuable insights for future SSM research.</div><div>Code: <span><span>https://github.com/Xuwei86/EMB/tree/main</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"161 ","pages":"Article 105569"},"PeriodicalIF":4.2000,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S026288562500157X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Mamba models have emerged as strong competitors to Transformers due to their efficient long-sequence processing and high memory efficiency. However, their state space models (SSMs) suffer from limitations in capturing long-range dependencies, lack of channel interactions, and weak generalization in vision tasks.

To address these issues, we propose Efficient Mamba (EMB), an innovative framework that enhances SSMs while integrating convolutional neural networks (CNNs) and Transformers to mitigate their inherent drawbacks. The key contributions of EMB are as follows: (1) We introduce the TransSSM module, which incorporates feature flipping and channel shuffle to enhance channel interactions and improve generalization. Additionally, we propose the Window Spatial Attention (WSA) module for precise local feature modeling and Dual Pooling Attention (DPA) to improve global feature modeling and model stability. (2) We design the MFB-SCFB composite structure, which integrates TransSSM, WSA, Inverted Residual Block(IRBs), and convolutional attention modules to facilitate effective global–local feature interaction.

EMB achieves state-of-the-art (SOTA) performance across multiple vision tasks. For instance, on ImageNet classification, EMB-S/T/N achieves Top-1 accuracies of 78.9%, 76.3%, and 73.5%, with model sizes and FLOPs of 5.9M/1.5G, 2.5M/0.6G, and 1.4M/0.3G, respectively, when trained on a single NVIDIA 4090 GPU.

Experimental results demonstrate that EMB provides a novel paradigm for efficient vision model design, offering valuable insights for future SSM research.

Code: https://github.com/Xuwei86/EMB/tree/main.

查看原文本刊更多论文

高效曼巴：用创新的结构克服曼巴的视觉限制

曼巴模型由于其高效的长序列处理和高存储效率而成为变形金刚的有力竞争对手。然而，它们的状态空间模型（ssm）在捕获远程依赖关系、缺乏通道交互以及视觉任务的弱泛化方面存在局限性。为了解决这些问题，我们提出了高效曼巴（EMB），这是一个创新的框架，可以增强ssm，同时集成卷积神经网络（cnn）和变压器，以减轻其固有的缺点。EMB的主要贡献如下：(1)我们引入了TransSSM模块，该模块结合了特征翻转和信道洗牌，以增强信道交互和提高泛化。此外，我们提出了用于精确局部特征建模的窗口空间注意（WSA）模块和用于提高全局特征建模和模型稳定性的双池注意（DPA）模块。(2)设计了MFB-SCFB复合结构，该结构集成了TransSSM、WSA、倒残差块（IRBs）和卷积关注模块，实现了有效的全局-局部特征交互。EMB在多个视觉任务中实现了最先进的（SOTA）性能。例如，在ImageNet分类上，在单个NVIDIA 4090 GPU上训练时，EMB-S/T/N的Top-1准确率分别为78.9%，76.3%和73.5%，模型尺寸和FLOPs分别为5.9M/1.5G, 2.5M/0.6G和1.4M/0.3G。实验结果表明，EMB为有效的视觉模型设计提供了一种新的范式，为未来的SSM研究提供了有价值的见解。代码:https://github.com/Xuwei86/EMB/tree/main。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.