SimPoolFormer: A two-stream vision transformer for hyperspectral image classification

IF 3.8 Q2 ENVIRONMENTAL SCIENCES

Remote Sensing Applications-Society and Environment Pub Date : 2025-01-01 DOI:10.1016/j.rsase.2025.101478

Swalpa Kumar Roy , Ali Jamali , Jocelyn Chanussot , Pedram Ghamisi , Ebrahim Ghaderpour , Himan Shahabi

{"title":"SimPoolFormer: A two-stream vision transformer for hyperspectral image classification","authors":"Swalpa Kumar Roy , Ali Jamali , Jocelyn Chanussot , Pedram Ghamisi , Ebrahim Ghaderpour , Himan Shahabi","doi":"10.1016/j.rsase.2025.101478","DOIUrl":null,"url":null,"abstract":"<div><div>The ability of vision transformers (ViTs) to accurately model global dependencies has completely changed the field of vision research. However, because of their drawbacks, such as their high computational costs, dependence on significant labeled datasets, and restricted capacity to capture essential local features, efforts are being made to create more effective alternatives. On the other hand, vision multilayer perceptron (MLP) architectures have shown excellent capability in image classification tasks, performing equivalent to or even better than the widely used state-of-the-art ViTs and convolutional neural networks (CNNs). Vision MLPs have linear computational complexity, require less training data, and can attain long-range data dependencies through advanced mechanisms similar to transformers at much lower computational costs. Thus, in this paper, a novel deep learning architecture is developed, namely, SimPoolFormer, to address current shortcomings imposed by vision transformers. SimPoolFormer is a two-stream attention-in-attention vision transformer architecture based on two computationally efficient networks. The developed architecture replaces the computationally intensive multi-headed self-attention in ViT with SimPool for efficiency, while ResMLP is adopted in a second stream to enhance hyperspectral image (HSI) classification, leveraging its linear attention-based design. Results illustrate that SimPoolFormer is significantly superior to several other deep learning models, including 1D-CNN, 2D-CNN, RNN, VGG-16, EfficientNet, ResNet-50, and ViT on three complex HSI datasets: QUH-Tangdaowan, QUH-Qingyun, and QUH-Pingan. For example, in terms of average accuracy, SimPoolFormer improved the HSI classification accuracy over 2D-CNN, VGG-16, EfficientNet, ViT, ResNet-50, RNN, and 1D-CNN by 0.98%, 3.81%, 4.16%, 7.94%, 9.45%, 12.25%, and 13.95%, respectively, on the QUH-Qingyun dataset.</div></div>","PeriodicalId":53227,"journal":{"name":"Remote Sensing Applications-Society and Environment","volume":"37 ","pages":"Article 101478"},"PeriodicalIF":3.8000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Remote Sensing Applications-Society and Environment","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S235293852500031X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

The ability of vision transformers (ViTs) to accurately model global dependencies has completely changed the field of vision research. However, because of their drawbacks, such as their high computational costs, dependence on significant labeled datasets, and restricted capacity to capture essential local features, efforts are being made to create more effective alternatives. On the other hand, vision multilayer perceptron (MLP) architectures have shown excellent capability in image classification tasks, performing equivalent to or even better than the widely used state-of-the-art ViTs and convolutional neural networks (CNNs). Vision MLPs have linear computational complexity, require less training data, and can attain long-range data dependencies through advanced mechanisms similar to transformers at much lower computational costs. Thus, in this paper, a novel deep learning architecture is developed, namely, SimPoolFormer, to address current shortcomings imposed by vision transformers. SimPoolFormer is a two-stream attention-in-attention vision transformer architecture based on two computationally efficient networks. The developed architecture replaces the computationally intensive multi-headed self-attention in ViT with SimPool for efficiency, while ResMLP is adopted in a second stream to enhance hyperspectral image (HSI) classification, leveraging its linear attention-based design. Results illustrate that SimPoolFormer is significantly superior to several other deep learning models, including 1D-CNN, 2D-CNN, RNN, VGG-16, EfficientNet, ResNet-50, and ViT on three complex HSI datasets: QUH-Tangdaowan, QUH-Qingyun, and QUH-Pingan. For example, in terms of average accuracy, SimPoolFormer improved the HSI classification accuracy over 2D-CNN, VGG-16, EfficientNet, ViT, ResNet-50, RNN, and 1D-CNN by 0.98%, 3.81%, 4.16%, 7.94%, 9.45%, 12.25%, and 13.95%, respectively, on the QUH-Qingyun dataset.

查看原文本刊更多论文

SimPoolFormer：用于高光谱图像分类的双流视觉转换器

视觉变换（ViTs）精确建模全局依赖关系的能力彻底改变了视觉研究领域。然而，由于它们的缺点，例如计算成本高，依赖于重要的标记数据集，以及捕获基本局部特征的能力有限，人们正在努力创造更有效的替代方案。另一方面，视觉多层感知器（MLP）架构在图像分类任务中表现出出色的能力，其性能相当于甚至优于广泛使用的最先进的vit和卷积神经网络（cnn）。视觉mlp具有线性计算复杂性，需要较少的训练数据，并且可以通过类似于变压器的先进机制以更低的计算成本获得远程数据依赖性。因此，本文开发了一种新的深度学习架构，即SimPoolFormer，以解决当前视觉变压器所带来的缺点。SimPoolFormer是一种基于两个计算效率高的网络的双流注意力-注意力视觉转换器架构。所开发的架构用SimPool取代了ViT中计算密集型的多头自关注以提高效率，而在第二流中采用ResMLP来增强高光谱图像（HSI）分类，利用其基于线性关注的设计。结果表明，SimPoolFormer在三种复杂的HSI数据集上显著优于其他几种深度学习模型，包括1D-CNN、2D-CNN、RNN、VGG-16、EfficientNet、ResNet-50和ViT: quh -汤道湾、quh -青云和quh -平安。例如，在平均准确率方面，SimPoolFormer在QUH-Qingyun数据集上，将2D-CNN、vgg16、EfficientNet、ViT、ResNet-50、RNN和1D-CNN的HSI分类准确率分别提高了0.98%、3.81%、4.16%、7.94%、9.45%、12.25%和13.95%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Remote Sensing Applications-Society and Environment Multiple-

CiteScore

8.00

自引率

8.50%

发文量

204

审稿时长

65 days

期刊介绍： The journal ''Remote Sensing Applications: Society and Environment'' (RSASE) focuses on remote sensing studies that address specific topics with an emphasis on environmental and societal issues - regional / local studies with global significance. Subjects are encouraged to have an interdisciplinary approach and include, but are not limited by: " -Global and climate change studies addressing the impact of increasing concentrations of greenhouse gases, CO2 emission, carbon balance and carbon mitigation, energy system on social and environmental systems -Ecological and environmental issues including biodiversity, ecosystem dynamics, land degradation, atmospheric and water pollution, urban footprint, ecosystem management and natural hazards (e.g. earthquakes, typhoons, floods, landslides) -Natural resource studies including land-use in general, biomass estimation, forests, agricultural land, plantation, soils, coral reefs, wetland and water resources -Agriculture, food production systems and food security outcomes -Socio-economic issues including urban systems, urban growth, public health, epidemics, land-use transition and land use conflicts -Oceanography and coastal zone studies, including sea level rise projections, coastlines changes and the ocean-land interface -Regional challenges for remote sensing application techniques, monitoring and analysis, such as cloud screening and atmospheric correction for tropical regions -Interdisciplinary studies combining remote sensing, household survey data, field measurements and models to address environmental, societal and sustainability issues -Quantitative and qualitative analysis that documents the impact of using remote sensing studies in social, political, environmental or economic systems