{"title":"MLMamba: A Mamba-Based Efficient Network for Multi-Label Remote Sensing Scene Classification","authors":"Ruiqi Du;Xu Tang;Jingjing Ma;Xiangrong Zhang;Licheng Jiao","doi":"10.1109/TCSVT.2025.3535939","DOIUrl":null,"url":null,"abstract":"As a useful remote sensing (RS) scene interpretation technique, multi-label RS scene classification (RSSC) always attracts researchers’ attention and plays an important role in the RS community. To assign multiple semantic labels to a single RS image according to its complex contents, the existing methods focus on learning the valuable visual features and mining the latent semantic relationships from the RS images. This is a feasible and helpful solution. However, they are often associated with high computational costs due to the widespread use of Transformers. To alleviate this problem, we propose a Mamba-based efficient network based on the newly emerged state space model called MLMamba. In addition to the basic feature extractor (convolutional neural network and language model) and classifier (multiple perceptrons), MLMamba consists of two key components: a pyramid Mamba and a feature-guided semantic modeling (FGSM) Mamba. Pyramid Mamba uses multi-scale scanning to establish global relationships within and across different scales, improving MLMamba’s ability to explore RS images. Under the guidance of the obtained visual features, FGSM Mamba establishes associations between different land covers. Combining these two components can deeply mine local features, multi-scale information, and long-range dependencies from RS images and build semantic relationships between different surface covers. These superiorities guarantee that MLMamba can fully understand the complex contents within RS images and accurately determine which categories exist. Furthermore, the simple and effective structure and linear computational complexity of the state space model ensure that pyramid Mamba and FGSM Mamba will not impose too much computational burden on MLMamba. Extensive experiments counted on three benchmark multi-label RSSC data sets validate the effectiveness of MLMamba. The positive results demonstrate that MLMamba achieves state-of-the-art performance, surpassing existing methods in accuracy, model size, and computational efficiency. Our source codes are available at <uri>https://github.com/TangXu-Group/ multilabelRSSC/tree/main/MLMamba</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 7","pages":"6245-6258"},"PeriodicalIF":11.1000,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10857393/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
As a useful remote sensing (RS) scene interpretation technique, multi-label RS scene classification (RSSC) always attracts researchers’ attention and plays an important role in the RS community. To assign multiple semantic labels to a single RS image according to its complex contents, the existing methods focus on learning the valuable visual features and mining the latent semantic relationships from the RS images. This is a feasible and helpful solution. However, they are often associated with high computational costs due to the widespread use of Transformers. To alleviate this problem, we propose a Mamba-based efficient network based on the newly emerged state space model called MLMamba. In addition to the basic feature extractor (convolutional neural network and language model) and classifier (multiple perceptrons), MLMamba consists of two key components: a pyramid Mamba and a feature-guided semantic modeling (FGSM) Mamba. Pyramid Mamba uses multi-scale scanning to establish global relationships within and across different scales, improving MLMamba’s ability to explore RS images. Under the guidance of the obtained visual features, FGSM Mamba establishes associations between different land covers. Combining these two components can deeply mine local features, multi-scale information, and long-range dependencies from RS images and build semantic relationships between different surface covers. These superiorities guarantee that MLMamba can fully understand the complex contents within RS images and accurately determine which categories exist. Furthermore, the simple and effective structure and linear computational complexity of the state space model ensure that pyramid Mamba and FGSM Mamba will not impose too much computational burden on MLMamba. Extensive experiments counted on three benchmark multi-label RSSC data sets validate the effectiveness of MLMamba. The positive results demonstrate that MLMamba achieves state-of-the-art performance, surpassing existing methods in accuracy, model size, and computational efficiency. Our source codes are available at https://github.com/TangXu-Group/ multilabelRSSC/tree/main/MLMamba.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.