{"title":"基于跨模态层次特征融合的航空图像语义分割方法","authors":"Jinglei Bai;Jinfu Yang;Tao Xiang;Shu Cai","doi":"10.1109/LGRS.2025.3602267","DOIUrl":null,"url":null,"abstract":"Multimodal aerial image semantic segmentation enables fine-grained land cover classification by integrating data from different sensors, yet it remains challenged by information redundancy, intermodal feature discrepancies, and class confusion in complex scenes. To address these issues, we propose a cross-modal hierarchical feature fusion network (CMHFNet) based on an encoder–decoder architecture. The encoder incorporates a pixelwise attention-guided fusion module (PAFM) and a multistage progressive fusion transformer (MPFT) to suppress redundancy and model long-range intermodal dependencies and scale variations. The decoder introduces a residual information-guided feature compensation mechanism to recover spatial details and mitigate class ambiguity. The experiments on DDOS, Vaihingen, and Potsdam datasets demonstrate that the CMHFNet surpasses state-of-the-art methods, validating its effectiveness and practical value.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":4.4000,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Aerial Image Semantic Segmentation Method Based on Cross-Modal Hierarchical Feature Fusion\",\"authors\":\"Jinglei Bai;Jinfu Yang;Tao Xiang;Shu Cai\",\"doi\":\"10.1109/LGRS.2025.3602267\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multimodal aerial image semantic segmentation enables fine-grained land cover classification by integrating data from different sensors, yet it remains challenged by information redundancy, intermodal feature discrepancies, and class confusion in complex scenes. To address these issues, we propose a cross-modal hierarchical feature fusion network (CMHFNet) based on an encoder–decoder architecture. The encoder incorporates a pixelwise attention-guided fusion module (PAFM) and a multistage progressive fusion transformer (MPFT) to suppress redundancy and model long-range intermodal dependencies and scale variations. The decoder introduces a residual information-guided feature compensation mechanism to recover spatial details and mitigate class ambiguity. The experiments on DDOS, Vaihingen, and Potsdam datasets demonstrate that the CMHFNet surpasses state-of-the-art methods, validating its effectiveness and practical value.\",\"PeriodicalId\":91017,\"journal\":{\"name\":\"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society\",\"volume\":\"22 \",\"pages\":\"1-5\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-08-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11137359/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11137359/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Aerial Image Semantic Segmentation Method Based on Cross-Modal Hierarchical Feature Fusion
Multimodal aerial image semantic segmentation enables fine-grained land cover classification by integrating data from different sensors, yet it remains challenged by information redundancy, intermodal feature discrepancies, and class confusion in complex scenes. To address these issues, we propose a cross-modal hierarchical feature fusion network (CMHFNet) based on an encoder–decoder architecture. The encoder incorporates a pixelwise attention-guided fusion module (PAFM) and a multistage progressive fusion transformer (MPFT) to suppress redundancy and model long-range intermodal dependencies and scale variations. The decoder introduces a residual information-guided feature compensation mechanism to recover spatial details and mitigate class ambiguity. The experiments on DDOS, Vaihingen, and Potsdam datasets demonstrate that the CMHFNet surpasses state-of-the-art methods, validating its effectiveness and practical value.