Distillation-free Scaling of Large SSMs for Images and Videos

Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall
{"title":"Distillation-free Scaling of Large SSMs for Images and Videos","authors":"Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall","doi":"arxiv-2409.11867","DOIUrl":null,"url":null,"abstract":"State-space models (SSMs), exemplified by S4, have introduced a novel context\nmodeling method by integrating state-space techniques into deep learning.\nHowever, they struggle with global context modeling due to their\ndata-independent matrices. The Mamba model addressed this with data-dependent\nvariants via the S6 selective-scan algorithm, enhancing context modeling,\nespecially for long sequences. However, Mamba-based architectures are difficult\nto scale with respect to the number of parameters, which is a major limitation\nfor vision applications. This paper addresses the scalability issue of large\nSSMs for image classification and action recognition without requiring\nadditional techniques like knowledge distillation. We analyze the distinct\ncharacteristics of Mamba-based and Attention-based models, proposing a\nMamba-Attention interleaved architecture that enhances scalability, robustness,\nand performance. We demonstrate that the stable and efficient interleaved\narchitecture resolves the scalability issue of Mamba-based architectures for\nimages and videos and increases robustness to common artifacts like JPEG\ncompression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and\nSomething-Something-v2 benchmarks demonstrates that our approach improves the\naccuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11867","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.
图像和视频大型 SSM 的无蒸馏缩放
以 S4 为代表的状态空间模型(SSM)通过将状态空间技术整合到深度学习中,引入了一种新的上下文建模方法。Mamba 模型通过 S6 选择性扫描算法,利用与数据相关的变量解决了这一问题,从而增强了上下文建模能力,尤其是对于长序列。然而,基于 Mamba 的架构很难扩展参数的数量,这是视觉应用的一大局限。本文在不需要知识提炼等附加技术的情况下,解决了用于图像分类和动作识别的大型 SSM 的可扩展性问题。我们分析了基于 Mamba 的模型和基于 Attention 的模型的不同特点,提出了一种可增强可扩展性、鲁棒性和性能的 Mamba-Attention 交错架构。我们证明,这种稳定高效的交错架构解决了基于 Mamba 架构的图像和视频可扩展性问题,并增强了对 JPEG 压缩等常见伪影的鲁棒性。我们在 ImageNet-1K、Kinetics-400 和 Something-Something-v2 基准上进行的全面评估表明,我们的方法将基于 Mamba 的最先进架构的准确性提高了高达 $+1.7$。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信