Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall
{"title":"图像和视频大型 SSM 的无蒸馏缩放","authors":"Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall","doi":"arxiv-2409.11867","DOIUrl":null,"url":null,"abstract":"State-space models (SSMs), exemplified by S4, have introduced a novel context\nmodeling method by integrating state-space techniques into deep learning.\nHowever, they struggle with global context modeling due to their\ndata-independent matrices. The Mamba model addressed this with data-dependent\nvariants via the S6 selective-scan algorithm, enhancing context modeling,\nespecially for long sequences. However, Mamba-based architectures are difficult\nto scale with respect to the number of parameters, which is a major limitation\nfor vision applications. This paper addresses the scalability issue of large\nSSMs for image classification and action recognition without requiring\nadditional techniques like knowledge distillation. We analyze the distinct\ncharacteristics of Mamba-based and Attention-based models, proposing a\nMamba-Attention interleaved architecture that enhances scalability, robustness,\nand performance. We demonstrate that the stable and efficient interleaved\narchitecture resolves the scalability issue of Mamba-based architectures for\nimages and videos and increases robustness to common artifacts like JPEG\ncompression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and\nSomething-Something-v2 benchmarks demonstrates that our approach improves the\naccuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distillation-free Scaling of Large SSMs for Images and Videos\",\"authors\":\"Hamid Suleman, Syed Talal Wasim, Muzammal Naseer, Juergen Gall\",\"doi\":\"arxiv-2409.11867\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"State-space models (SSMs), exemplified by S4, have introduced a novel context\\nmodeling method by integrating state-space techniques into deep learning.\\nHowever, they struggle with global context modeling due to their\\ndata-independent matrices. The Mamba model addressed this with data-dependent\\nvariants via the S6 selective-scan algorithm, enhancing context modeling,\\nespecially for long sequences. However, Mamba-based architectures are difficult\\nto scale with respect to the number of parameters, which is a major limitation\\nfor vision applications. This paper addresses the scalability issue of large\\nSSMs for image classification and action recognition without requiring\\nadditional techniques like knowledge distillation. We analyze the distinct\\ncharacteristics of Mamba-based and Attention-based models, proposing a\\nMamba-Attention interleaved architecture that enhances scalability, robustness,\\nand performance. We demonstrate that the stable and efficient interleaved\\narchitecture resolves the scalability issue of Mamba-based architectures for\\nimages and videos and increases robustness to common artifacts like JPEG\\ncompression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and\\nSomething-Something-v2 benchmarks demonstrates that our approach improves the\\naccuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11867\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11867","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Distillation-free Scaling of Large SSMs for Images and Videos
State-space models (SSMs), exemplified by S4, have introduced a novel context
modeling method by integrating state-space techniques into deep learning.
However, they struggle with global context modeling due to their
data-independent matrices. The Mamba model addressed this with data-dependent
variants via the S6 selective-scan algorithm, enhancing context modeling,
especially for long sequences. However, Mamba-based architectures are difficult
to scale with respect to the number of parameters, which is a major limitation
for vision applications. This paper addresses the scalability issue of large
SSMs for image classification and action recognition without requiring
additional techniques like knowledge distillation. We analyze the distinct
characteristics of Mamba-based and Attention-based models, proposing a
Mamba-Attention interleaved architecture that enhances scalability, robustness,
and performance. We demonstrate that the stable and efficient interleaved
architecture resolves the scalability issue of Mamba-based architectures for
images and videos and increases robustness to common artifacts like JPEG
compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and
Something-Something-v2 benchmarks demonstrates that our approach improves the
accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.