Transforming Earth Observation: An Extensive Evaluation of Vision Transformers for Satellite Images-Based Land Cover Classification

IF 2.3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems Pub Date : 2025-06-10 DOI:10.1111/exsy.70082

Fakhri Alam Khan

{"title":"Transforming Earth Observation: An Extensive Evaluation of Vision Transformers for Satellite Images-Based Land Cover Classification","authors":"Fakhri Alam Khan","doi":"10.1111/exsy.70082","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>Satellite imagery offers rich information for land cover classification, but choosing an effective yet efficient feature extractor or backbone architecture remains challenging. In this study, I benchmark 25 vision-transformers across 10 public land cover datasets to guide backbone selection for downstream classification tasks. The proposed approach encodes each satellite image into a fixed-length feature vector via a pre-trained transformer, then trains and tests a linear support-vector classifier on these encodings to isolate the impact of the backbone alone. I report average classification accuracy and F1-score over three random stratified splits per dataset, and I also measure training time to assess the computational cost. Results show that the image encoding performed using large-receptive-field transformers with advanced self-attention—particularly <span>deit3_base_patch16_224</span> and <span>twins_svt_large</span>—achieve the highest accuracies without incurring prohibitive training times. In contrast, encodings of the compact variants achieve faster training but incur notable performance drops around 7%–8%. These findings reveal a clear trade-off between representational power and efficiency. Practitioners can leverage such rankings to select a transformer backbone that best balances accuracy and computational efficiency for satellite image-based land cover classification tasks, accelerating the development of robust and resource-aware systems.</p>\n </div>","PeriodicalId":51053,"journal":{"name":"Expert Systems","volume":"42 7","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/exsy.70082","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Satellite imagery offers rich information for land cover classification, but choosing an effective yet efficient feature extractor or backbone architecture remains challenging. In this study, I benchmark 25 vision-transformers across 10 public land cover datasets to guide backbone selection for downstream classification tasks. The proposed approach encodes each satellite image into a fixed-length feature vector via a pre-trained transformer, then trains and tests a linear support-vector classifier on these encodings to isolate the impact of the backbone alone. I report average classification accuracy and F1-score over three random stratified splits per dataset, and I also measure training time to assess the computational cost. Results show that the image encoding performed using large-receptive-field transformers with advanced self-attention—particularly deit3_base_patch16_224 and twins_svt_large—achieve the highest accuracies without incurring prohibitive training times. In contrast, encodings of the compact variants achieve faster training but incur notable performance drops around 7%–8%. These findings reveal a clear trade-off between representational power and efficiency. Practitioners can leverage such rankings to select a transformer backbone that best balances accuracy and computational efficiency for satellite image-based land cover classification tasks, accelerating the development of robust and resource-aware systems.

查看原文本刊更多论文

转换地球观测：基于卫星图像的土地覆盖分类视觉变换的广泛评价

卫星图像为土地覆盖分类提供了丰富的信息，但选择有效的特征提取器或主干结构仍然是一个挑战。在本研究中，我对10个公共土地覆盖数据集的25个视觉变形进行基准测试，以指导下游分类任务的主干选择。该方法通过预训练的变压器将每个卫星图像编码为固定长度的特征向量，然后在这些编码上训练和测试线性支持向量分类器，以隔离主干网单独的影响。我报告了每个数据集在三个随机分层分割上的平均分类精度和f1分数，我还测量了训练时间来评估计算成本。结果表明，使用具有高级自关注的大接收场转换器（特别是deit3_base_patch16_224和twins_svt_large）进行的图像编码实现了最高的精度，而不会产生令人禁止的训练时间。相比之下，紧凑变体的编码实现了更快的训练，但会导致显着的性能下降约7%-8%。这些发现揭示了代表性权力和效率之间的明显权衡。从业者可以利用这样的排名来选择一个变压器主干，这个主干最好地平衡了基于卫星图像的土地覆盖分类任务的准确性和计算效率，从而加速了健壮和资源感知系统的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems 工程技术-计算机：理论方法

CiteScore

7.40

自引率

6.10%

发文量

266

审稿时长

24 months

期刊介绍： Expert Systems: The Journal of Knowledge Engineering publishes papers dealing with all aspects of knowledge engineering, including individual methods and techniques in knowledge acquisition and representation, and their application in the construction of systems – including expert systems – based thereon. Detailed scientific evaluation is an essential part of any paper. As well as traditional application areas, such as Software and Requirements Engineering, Human-Computer Interaction, and Artificial Intelligence, we are aiming at the new and growing markets for these technologies, such as Business, Economy, Market Research, and Medical and Health Care. The shift towards this new focus will be marked by a series of special issues covering hot and emergent topics.