Swalpa Kumar Roy , Ali Jamali , Koushik Biswas , Danfeng Hong , Pedram Ghamisi
{"title":"ViCxLSTM:一种用于复杂遥感场景分类的扩展长短期记忆视觉转换器","authors":"Swalpa Kumar Roy , Ali Jamali , Koushik Biswas , Danfeng Hong , Pedram Ghamisi","doi":"10.1016/j.jag.2025.104801","DOIUrl":null,"url":null,"abstract":"<div><div>Scene classification plays a critical role in remote sensing image analysis, with numerous methods based on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) developed to improve performance on high-resolution remote sensing (HRRS) imagery. However, the existing models struggle with several key challenges, including effectively capturing fine-grained local features and modeling long-range spatial dependencies in complex scenes. These limitations reduce the discriminative power of extracted features, which is critical for HRRS image classification. To overcome these issues, our study aims to design a unified model that jointly leverages local information extraction, global context modeling, and long-range dependency learning. We propose a novel architecture, ViCxLSTM, designed to enhance feature discriminability for HRRS scene classification. ViCxLSTM is a hybrid model that integrates a Local Pattern Unit (comprising convolutional layers and Fourier Transforms), an extended Long Short-Term Memory module (xLSTM), and a Vision Transformer. This integrated architecture enables the model to capture a wide range of spatial patterns, from local textures to long-range dependencies and global contextual relationships. Experimental evaluations show that ViCxLSTM achieves superior classification performance across diverse land use datasets, outperforming several state-of-the-art models, including ResNet-50, ResNet-101, ResNet-152, ViT, LeViT, CrossViT, DeepViT, and CaiT. The code will be provided freely accessible at <span><span>https://github.com/aj1365/ViCxLSTM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":73423,"journal":{"name":"International journal of applied earth observation and geoinformation : ITC journal","volume":"143 ","pages":"Article 104801"},"PeriodicalIF":8.6000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ViCxLSTM: An extended Long Short-term Memory vision transformer for complex remote sensing scene classification\",\"authors\":\"Swalpa Kumar Roy , Ali Jamali , Koushik Biswas , Danfeng Hong , Pedram Ghamisi\",\"doi\":\"10.1016/j.jag.2025.104801\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Scene classification plays a critical role in remote sensing image analysis, with numerous methods based on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) developed to improve performance on high-resolution remote sensing (HRRS) imagery. However, the existing models struggle with several key challenges, including effectively capturing fine-grained local features and modeling long-range spatial dependencies in complex scenes. These limitations reduce the discriminative power of extracted features, which is critical for HRRS image classification. To overcome these issues, our study aims to design a unified model that jointly leverages local information extraction, global context modeling, and long-range dependency learning. We propose a novel architecture, ViCxLSTM, designed to enhance feature discriminability for HRRS scene classification. ViCxLSTM is a hybrid model that integrates a Local Pattern Unit (comprising convolutional layers and Fourier Transforms), an extended Long Short-Term Memory module (xLSTM), and a Vision Transformer. This integrated architecture enables the model to capture a wide range of spatial patterns, from local textures to long-range dependencies and global contextual relationships. Experimental evaluations show that ViCxLSTM achieves superior classification performance across diverse land use datasets, outperforming several state-of-the-art models, including ResNet-50, ResNet-101, ResNet-152, ViT, LeViT, CrossViT, DeepViT, and CaiT. The code will be provided freely accessible at <span><span>https://github.com/aj1365/ViCxLSTM</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":73423,\"journal\":{\"name\":\"International journal of applied earth observation and geoinformation : ITC journal\",\"volume\":\"143 \",\"pages\":\"Article 104801\"},\"PeriodicalIF\":8.6000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International journal of applied earth observation and geoinformation : ITC journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1569843225004480\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"REMOTE SENSING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of applied earth observation and geoinformation : ITC journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1569843225004480","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"REMOTE SENSING","Score":null,"Total":0}
ViCxLSTM: An extended Long Short-term Memory vision transformer for complex remote sensing scene classification
Scene classification plays a critical role in remote sensing image analysis, with numerous methods based on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) developed to improve performance on high-resolution remote sensing (HRRS) imagery. However, the existing models struggle with several key challenges, including effectively capturing fine-grained local features and modeling long-range spatial dependencies in complex scenes. These limitations reduce the discriminative power of extracted features, which is critical for HRRS image classification. To overcome these issues, our study aims to design a unified model that jointly leverages local information extraction, global context modeling, and long-range dependency learning. We propose a novel architecture, ViCxLSTM, designed to enhance feature discriminability for HRRS scene classification. ViCxLSTM is a hybrid model that integrates a Local Pattern Unit (comprising convolutional layers and Fourier Transforms), an extended Long Short-Term Memory module (xLSTM), and a Vision Transformer. This integrated architecture enables the model to capture a wide range of spatial patterns, from local textures to long-range dependencies and global contextual relationships. Experimental evaluations show that ViCxLSTM achieves superior classification performance across diverse land use datasets, outperforming several state-of-the-art models, including ResNet-50, ResNet-101, ResNet-152, ViT, LeViT, CrossViT, DeepViT, and CaiT. The code will be provided freely accessible at https://github.com/aj1365/ViCxLSTM.
期刊介绍:
The International Journal of Applied Earth Observation and Geoinformation publishes original papers that utilize earth observation data for natural resource and environmental inventory and management. These data primarily originate from remote sensing platforms, including satellites and aircraft, supplemented by surface and subsurface measurements. Addressing natural resources such as forests, agricultural land, soils, and water, as well as environmental concerns like biodiversity, land degradation, and hazards, the journal explores conceptual and data-driven approaches. It covers geoinformation themes like capturing, databasing, visualization, interpretation, data quality, and spatial uncertainty.