Yue Ni , Donglin Xue , Weijian Chi , Ji Luan , Jiahang Liu
{"title":"CSFAFormer:用于多模态遥感图像语义分割的分类选择性特征聚合转换器","authors":"Yue Ni , Donglin Xue , Weijian Chi , Ji Luan , Jiahang Liu","doi":"10.1016/j.inffus.2025.103786","DOIUrl":null,"url":null,"abstract":"<div><div>Feature fusion is one of the keys to multimodal data segmentation. Different fusion mechanisms vary significantly in how effectively they utilize inter-modal features, exploit complementary information, and enhance representations, while also greatly affecting model parameters and computational complexity. Cross-attention fusion mechanism (CAFM) is the most widely used feature fusion mechanism in the current multimodal fusion classification task, but due to the inherent limitation, it cannot adapt to the differentiated feature requirements of different classes and leads to the blurring of interclass and dispersal features of intraclass. To address these challenges, a novel Category-Selective Feature Aggregation Transformer (CSFAFormer) is proposed to dynamically adjust the interaction weights between modalities along the class dimension, thereby fully leveraging the complementary advantages of different modalities. To accommodate the differentiated needs of different categories, a Category Cross-Calibration Mechanism (C<sup>3</sup>M) is designed to compress multi-channel features, estimate pixel-level class distributions, and employ a confidence-based cross-calibration strategy to dynamically adjust interaction weights along the class dimension, better accommodating the varying demands of different classes. To further semantic consistency and inter-class separability, a Category-Selective Transformer Module is proposed to leverage the class information calibrated by C<sup>3</sup>M for adaptive weighted fusion along the class dimension, thereby optimizing the representation of category-specific features. Experimental results indicate that CSFAFormer significantly outperforms in segmentation performance. Compared to the CAFM, CSFAFormer reduces the parameter count by 38.5 % and the computational cost by 72.3 %, while maintaining superior performance. The code is available at: <span><span>https://github.com/NUAALISILab/CSFAFormer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103786"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CSFAFormer: Category-selective feature aggregation transformer for multimodal remote sensing image semantic segmentation\",\"authors\":\"Yue Ni , Donglin Xue , Weijian Chi , Ji Luan , Jiahang Liu\",\"doi\":\"10.1016/j.inffus.2025.103786\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Feature fusion is one of the keys to multimodal data segmentation. Different fusion mechanisms vary significantly in how effectively they utilize inter-modal features, exploit complementary information, and enhance representations, while also greatly affecting model parameters and computational complexity. Cross-attention fusion mechanism (CAFM) is the most widely used feature fusion mechanism in the current multimodal fusion classification task, but due to the inherent limitation, it cannot adapt to the differentiated feature requirements of different classes and leads to the blurring of interclass and dispersal features of intraclass. To address these challenges, a novel Category-Selective Feature Aggregation Transformer (CSFAFormer) is proposed to dynamically adjust the interaction weights between modalities along the class dimension, thereby fully leveraging the complementary advantages of different modalities. To accommodate the differentiated needs of different categories, a Category Cross-Calibration Mechanism (C<sup>3</sup>M) is designed to compress multi-channel features, estimate pixel-level class distributions, and employ a confidence-based cross-calibration strategy to dynamically adjust interaction weights along the class dimension, better accommodating the varying demands of different classes. To further semantic consistency and inter-class separability, a Category-Selective Transformer Module is proposed to leverage the class information calibrated by C<sup>3</sup>M for adaptive weighted fusion along the class dimension, thereby optimizing the representation of category-specific features. Experimental results indicate that CSFAFormer significantly outperforms in segmentation performance. Compared to the CAFM, CSFAFormer reduces the parameter count by 38.5 % and the computational cost by 72.3 %, while maintaining superior performance. The code is available at: <span><span>https://github.com/NUAALISILab/CSFAFormer</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103786\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525008486\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008486","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Feature fusion is one of the keys to multimodal data segmentation. Different fusion mechanisms vary significantly in how effectively they utilize inter-modal features, exploit complementary information, and enhance representations, while also greatly affecting model parameters and computational complexity. Cross-attention fusion mechanism (CAFM) is the most widely used feature fusion mechanism in the current multimodal fusion classification task, but due to the inherent limitation, it cannot adapt to the differentiated feature requirements of different classes and leads to the blurring of interclass and dispersal features of intraclass. To address these challenges, a novel Category-Selective Feature Aggregation Transformer (CSFAFormer) is proposed to dynamically adjust the interaction weights between modalities along the class dimension, thereby fully leveraging the complementary advantages of different modalities. To accommodate the differentiated needs of different categories, a Category Cross-Calibration Mechanism (C3M) is designed to compress multi-channel features, estimate pixel-level class distributions, and employ a confidence-based cross-calibration strategy to dynamically adjust interaction weights along the class dimension, better accommodating the varying demands of different classes. To further semantic consistency and inter-class separability, a Category-Selective Transformer Module is proposed to leverage the class information calibrated by C3M for adaptive weighted fusion along the class dimension, thereby optimizing the representation of category-specific features. Experimental results indicate that CSFAFormer significantly outperforms in segmentation performance. Compared to the CAFM, CSFAFormer reduces the parameter count by 38.5 % and the computational cost by 72.3 %, while maintaining superior performance. The code is available at: https://github.com/NUAALISILab/CSFAFormer.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.