Jifa Chen , Gang Chen , Pin Zhou , Yufeng He , Lianzhe Yue , Mingjun Ding , Hui Lin
{"title":"CTSeg: CNN和ViT协同分割框架,用于高分辨率遥感图像的高效土地利用/土地覆盖制图","authors":"Jifa Chen , Gang Chen , Pin Zhou , Yufeng He , Lianzhe Yue , Mingjun Ding , Hui Lin","doi":"10.1016/j.jag.2025.104546","DOIUrl":null,"url":null,"abstract":"<div><div>Semantic segmentation models present significant work in land-use/land-cover (LULC) mapping. Even though vision transformers (ViT) with long-sequence interactions have recently emerged as popular solutions alongside convolutional neural networks (CNN), they remain less effective for high-resolution remote sensing data characterized by small volumes and rich heterogeneities. In this paper, we propose a novel CNN and ViT collaborated segmentation framework (CTSeg) to address these weaknesses. Following the encoder-decoder architecture, we first introduce an encoding backbone with multifarious attention mechanisms to respectively capture global and local contexts. It is designed with parallel dual branches where the position-relation aggregation (PRA) blocks and others with channel relations (CRA) form the CNN-based encoding module, whereas the ViT-based one comprises multi-stage window-shifted transformer (WST) blocks with cross-window interactions. We further explore the online knowledge distillation presented with pixel-wise and channel-wise feature distillation modules to facilitate bidirectional learning between the CNN and ViT backbones, supported by a well-designed loss decay strategy. In addition, we develop a multiscale feature decoding module to produce more high-quality segmentation predictions where the leveraged correlation-weighted fusions emphasize the heterogeneous feature representations. Extensive comparison and ablation studies on two benchmark datasets demonstrate its competitive performance in efficient LULC mapping.</div></div>","PeriodicalId":73423,"journal":{"name":"International journal of applied earth observation and geoinformation : ITC journal","volume":"139 ","pages":"Article 104546"},"PeriodicalIF":7.6000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CTSeg: CNN and ViT collaborated segmentation framework for efficient land-use/land-cover mapping with high-resolution remote sensing images\",\"authors\":\"Jifa Chen , Gang Chen , Pin Zhou , Yufeng He , Lianzhe Yue , Mingjun Ding , Hui Lin\",\"doi\":\"10.1016/j.jag.2025.104546\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Semantic segmentation models present significant work in land-use/land-cover (LULC) mapping. Even though vision transformers (ViT) with long-sequence interactions have recently emerged as popular solutions alongside convolutional neural networks (CNN), they remain less effective for high-resolution remote sensing data characterized by small volumes and rich heterogeneities. In this paper, we propose a novel CNN and ViT collaborated segmentation framework (CTSeg) to address these weaknesses. Following the encoder-decoder architecture, we first introduce an encoding backbone with multifarious attention mechanisms to respectively capture global and local contexts. It is designed with parallel dual branches where the position-relation aggregation (PRA) blocks and others with channel relations (CRA) form the CNN-based encoding module, whereas the ViT-based one comprises multi-stage window-shifted transformer (WST) blocks with cross-window interactions. We further explore the online knowledge distillation presented with pixel-wise and channel-wise feature distillation modules to facilitate bidirectional learning between the CNN and ViT backbones, supported by a well-designed loss decay strategy. In addition, we develop a multiscale feature decoding module to produce more high-quality segmentation predictions where the leveraged correlation-weighted fusions emphasize the heterogeneous feature representations. Extensive comparison and ablation studies on two benchmark datasets demonstrate its competitive performance in efficient LULC mapping.</div></div>\",\"PeriodicalId\":73423,\"journal\":{\"name\":\"International journal of applied earth observation and geoinformation : ITC journal\",\"volume\":\"139 \",\"pages\":\"Article 104546\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International journal of applied earth observation and geoinformation : ITC journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1569843225001931\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"REMOTE SENSING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International journal of applied earth observation and geoinformation : ITC journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1569843225001931","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"REMOTE SENSING","Score":null,"Total":0}
CTSeg: CNN and ViT collaborated segmentation framework for efficient land-use/land-cover mapping with high-resolution remote sensing images
Semantic segmentation models present significant work in land-use/land-cover (LULC) mapping. Even though vision transformers (ViT) with long-sequence interactions have recently emerged as popular solutions alongside convolutional neural networks (CNN), they remain less effective for high-resolution remote sensing data characterized by small volumes and rich heterogeneities. In this paper, we propose a novel CNN and ViT collaborated segmentation framework (CTSeg) to address these weaknesses. Following the encoder-decoder architecture, we first introduce an encoding backbone with multifarious attention mechanisms to respectively capture global and local contexts. It is designed with parallel dual branches where the position-relation aggregation (PRA) blocks and others with channel relations (CRA) form the CNN-based encoding module, whereas the ViT-based one comprises multi-stage window-shifted transformer (WST) blocks with cross-window interactions. We further explore the online knowledge distillation presented with pixel-wise and channel-wise feature distillation modules to facilitate bidirectional learning between the CNN and ViT backbones, supported by a well-designed loss decay strategy. In addition, we develop a multiscale feature decoding module to produce more high-quality segmentation predictions where the leveraged correlation-weighted fusions emphasize the heterogeneous feature representations. Extensive comparison and ablation studies on two benchmark datasets demonstrate its competitive performance in efficient LULC mapping.
期刊介绍:
The International Journal of Applied Earth Observation and Geoinformation publishes original papers that utilize earth observation data for natural resource and environmental inventory and management. These data primarily originate from remote sensing platforms, including satellites and aircraft, supplemented by surface and subsurface measurements. Addressing natural resources such as forests, agricultural land, soils, and water, as well as environmental concerns like biodiversity, land degradation, and hazards, the journal explores conceptual and data-driven approaches. It covers geoinformation themes like capturing, databasing, visualization, interpretation, data quality, and spatial uncertainty.