{"title":"Hyperspectral image classification using hybrid convolutional-based cross-patch retentive network","authors":"Rajat Kumar Arya, Rohith Peddi, Rajeev Srivastava","doi":"10.1016/j.cviu.2025.104382","DOIUrl":null,"url":null,"abstract":"<div><div>Vision transformer (ViT) is a widely used method to capture long-distance dependencies and has demonstrated remarkable results in classifying hyperspectral images (HSIs). Nevertheless, the fundamental component of ViT, self-attention, has difficulty striking a balance between global modeling and high computational complexity across entire input sequences. Recently, the Retentive Network (RetNet) was developed to address this issue, claiming to be more scalable and efficient than standard transformers. However, RetNet struggles to capture local features such as traditional transformers. This paper proposes a RetNet-based novel hybrid convolutional-based cross-patch retentive network (HCCRN). The proposed HCCRN model comprises a hybrid convolutional-based feature extraction (HCFE) module, a weighted feature tokenization module, and a cross-patch retentive network (CRN) module. The HCFE architecture combines four 2D convolutional layers and residual connections with a 3D convolutional layer to extract high-level fused spatial–spectral information and capture low-level spectral features. This hybrid method solves the vanishing gradient issue and comprehensively represents intricate spatial–spectral interactions by enabling hierarchical learning of spectral context and spatial dependencies. To further maximize processing efficiency, the acquired spatial–spectral data are transformed into semantic tokens by the tokenization module, which feeds them into the CRN module. CRN enriches feature representations and increases accuracy by utilizing a multi-head cross-patch retention mechanism to capture numerous semantic relations between input tokens. Extensive experiments on three benchmark datasets have shown that the proposed HCCRN architecture significantly outperforms state-of-the-art methods. It reduces computation time and increases classification accuracy, demonstrating its generalizability and robustness in the HSIC task.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"257 ","pages":"Article 104382"},"PeriodicalIF":3.5000,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001055","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Vision transformer (ViT) is a widely used method to capture long-distance dependencies and has demonstrated remarkable results in classifying hyperspectral images (HSIs). Nevertheless, the fundamental component of ViT, self-attention, has difficulty striking a balance between global modeling and high computational complexity across entire input sequences. Recently, the Retentive Network (RetNet) was developed to address this issue, claiming to be more scalable and efficient than standard transformers. However, RetNet struggles to capture local features such as traditional transformers. This paper proposes a RetNet-based novel hybrid convolutional-based cross-patch retentive network (HCCRN). The proposed HCCRN model comprises a hybrid convolutional-based feature extraction (HCFE) module, a weighted feature tokenization module, and a cross-patch retentive network (CRN) module. The HCFE architecture combines four 2D convolutional layers and residual connections with a 3D convolutional layer to extract high-level fused spatial–spectral information and capture low-level spectral features. This hybrid method solves the vanishing gradient issue and comprehensively represents intricate spatial–spectral interactions by enabling hierarchical learning of spectral context and spatial dependencies. To further maximize processing efficiency, the acquired spatial–spectral data are transformed into semantic tokens by the tokenization module, which feeds them into the CRN module. CRN enriches feature representations and increases accuracy by utilizing a multi-head cross-patch retention mechanism to capture numerous semantic relations between input tokens. Extensive experiments on three benchmark datasets have shown that the proposed HCCRN architecture significantly outperforms state-of-the-art methods. It reduces computation time and increases classification accuracy, demonstrating its generalizability and robustness in the HSIC task.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems