Hyperspectral image classification using hybrid convolutional-based cross-patch retentive network

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-05-07 DOI:10.1016/j.cviu.2025.104382

Rajat Kumar Arya, Rohith Peddi, Rajeev Srivastava

{"title":"Hyperspectral image classification using hybrid convolutional-based cross-patch retentive network","authors":"Rajat Kumar Arya, Rohith Peddi, Rajeev Srivastava","doi":"10.1016/j.cviu.2025.104382","DOIUrl":null,"url":null,"abstract":"<div><div>Vision transformer (ViT) is a widely used method to capture long-distance dependencies and has demonstrated remarkable results in classifying hyperspectral images (HSIs). Nevertheless, the fundamental component of ViT, self-attention, has difficulty striking a balance between global modeling and high computational complexity across entire input sequences. Recently, the Retentive Network (RetNet) was developed to address this issue, claiming to be more scalable and efficient than standard transformers. However, RetNet struggles to capture local features such as traditional transformers. This paper proposes a RetNet-based novel hybrid convolutional-based cross-patch retentive network (HCCRN). The proposed HCCRN model comprises a hybrid convolutional-based feature extraction (HCFE) module, a weighted feature tokenization module, and a cross-patch retentive network (CRN) module. The HCFE architecture combines four 2D convolutional layers and residual connections with a 3D convolutional layer to extract high-level fused spatial–spectral information and capture low-level spectral features. This hybrid method solves the vanishing gradient issue and comprehensively represents intricate spatial–spectral interactions by enabling hierarchical learning of spectral context and spatial dependencies. To further maximize processing efficiency, the acquired spatial–spectral data are transformed into semantic tokens by the tokenization module, which feeds them into the CRN module. CRN enriches feature representations and increases accuracy by utilizing a multi-head cross-patch retention mechanism to capture numerous semantic relations between input tokens. Extensive experiments on three benchmark datasets have shown that the proposed HCCRN architecture significantly outperforms state-of-the-art methods. It reduces computation time and increases classification accuracy, demonstrating its generalizability and robustness in the HSIC task.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104382"},"PeriodicalIF":3.5000,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001055","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Vision transformer (ViT) is a widely used method to capture long-distance dependencies and has demonstrated remarkable results in classifying hyperspectral images (HSIs). Nevertheless, the fundamental component of ViT, self-attention, has difficulty striking a balance between global modeling and high computational complexity across entire input sequences. Recently, the Retentive Network (RetNet) was developed to address this issue, claiming to be more scalable and efficient than standard transformers. However, RetNet struggles to capture local features such as traditional transformers. This paper proposes a RetNet-based novel hybrid convolutional-based cross-patch retentive network (HCCRN). The proposed HCCRN model comprises a hybrid convolutional-based feature extraction (HCFE) module, a weighted feature tokenization module, and a cross-patch retentive network (CRN) module. The HCFE architecture combines four 2D convolutional layers and residual connections with a 3D convolutional layer to extract high-level fused spatial–spectral information and capture low-level spectral features. This hybrid method solves the vanishing gradient issue and comprehensively represents intricate spatial–spectral interactions by enabling hierarchical learning of spectral context and spatial dependencies. To further maximize processing efficiency, the acquired spatial–spectral data are transformed into semantic tokens by the tokenization module, which feeds them into the CRN module. CRN enriches feature representations and increases accuracy by utilizing a multi-head cross-patch retention mechanism to capture numerous semantic relations between input tokens. Extensive experiments on three benchmark datasets have shown that the proposed HCCRN architecture significantly outperforms state-of-the-art methods. It reduces computation time and increases classification accuracy, demonstrating its generalizability and robustness in the HSIC task.

查看原文本刊更多论文

基于混合卷积交叉补丁保留网络的高光谱图像分类

视觉变换（Vision transformer, ViT）是一种广泛应用的远程依赖关系捕获方法，在高光谱图像分类中取得了显著的效果。然而，ViT的基本组成部分——自关注，很难在整个输入序列的全局建模和高计算复杂性之间取得平衡。最近，保留网络（RetNet）被开发来解决这个问题，声称比标准变压器更具可扩展性和效率。然而，RetNet很难捕捉到传统变压器等本地特征。提出了一种基于retnet的新型混合卷积交叉补丁保留网络（HCCRN）。提出的HCCRN模型包括基于混合卷积的特征提取（HCFE）模块、加权特征标记化模块和交叉补丁保留网络（CRN）模块。HCFE架构将四个二维卷积层和残差连接与一个三维卷积层相结合，提取高阶融合的空间光谱信息，捕获低阶光谱特征。该混合方法解决了梯度消失问题，通过光谱上下文和空间依赖关系的分层学习，全面表征了复杂的空间-光谱相互作用。为了进一步提高处理效率，获取的空间光谱数据通过标记化模块转换为语义标记，并将其输入CRN模块。CRN通过利用多头交叉补丁保留机制来捕获输入令牌之间的大量语义关系，丰富了特征表示并提高了准确性。在三个基准数据集上进行的大量实验表明，所提出的HCCRN架构明显优于最先进的方法。该方法减少了计算时间，提高了分类精度，证明了其在HSIC任务中的通用性和鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems