CSFAFormer：用于多模态遥感图像语义分割的分类选择性特征聚合转换器

IF 15.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2025-09-25 DOI:10.1016/j.inffus.2025.103786

Yue Ni , Donglin Xue , Weijian Chi , Ji Luan , Jiahang Liu

{"title":"CSFAFormer：用于多模态遥感图像语义分割的分类选择性特征聚合转换器","authors":"Yue Ni , Donglin Xue , Weijian Chi , Ji Luan , Jiahang Liu","doi":"10.1016/j.inffus.2025.103786","DOIUrl":null,"url":null,"abstract":"<div><div>Feature fusion is one of the keys to multimodal data segmentation. Different fusion mechanisms vary significantly in how effectively they utilize inter-modal features, exploit complementary information, and enhance representations, while also greatly affecting model parameters and computational complexity. Cross-attention fusion mechanism (CAFM) is the most widely used feature fusion mechanism in the current multimodal fusion classification task, but due to the inherent limitation, it cannot adapt to the differentiated feature requirements of different classes and leads to the blurring of interclass and dispersal features of intraclass. To address these challenges, a novel Category-Selective Feature Aggregation Transformer (CSFAFormer) is proposed to dynamically adjust the interaction weights between modalities along the class dimension, thereby fully leveraging the complementary advantages of different modalities. To accommodate the differentiated needs of different categories, a Category Cross-Calibration Mechanism (C<sup>3</sup>M) is designed to compress multi-channel features, estimate pixel-level class distributions, and employ a confidence-based cross-calibration strategy to dynamically adjust interaction weights along the class dimension, better accommodating the varying demands of different classes. To further semantic consistency and inter-class separability, a Category-Selective Transformer Module is proposed to leverage the class information calibrated by C<sup>3</sup>M for adaptive weighted fusion along the class dimension, thereby optimizing the representation of category-specific features. Experimental results indicate that CSFAFormer significantly outperforms in segmentation performance. Compared to the CAFM, CSFAFormer reduces the parameter count by 38.5 % and the computational cost by 72.3 %, while maintaining superior performance. The code is available at: <span><span>https://github.com/NUAALISILab/CSFAFormer</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"127 ","pages":"Article 103786"},"PeriodicalIF":15.5000,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CSFAFormer: Category-selective feature aggregation transformer for multimodal remote sensing image semantic segmentation\",\"authors\":\"Yue Ni , Donglin Xue , Weijian Chi , Ji Luan , Jiahang Liu\",\"doi\":\"10.1016/j.inffus.2025.103786\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Feature fusion is one of the keys to multimodal data segmentation. Different fusion mechanisms vary significantly in how effectively they utilize inter-modal features, exploit complementary information, and enhance representations, while also greatly affecting model parameters and computational complexity. Cross-attention fusion mechanism (CAFM) is the most widely used feature fusion mechanism in the current multimodal fusion classification task, but due to the inherent limitation, it cannot adapt to the differentiated feature requirements of different classes and leads to the blurring of interclass and dispersal features of intraclass. To address these challenges, a novel Category-Selective Feature Aggregation Transformer (CSFAFormer) is proposed to dynamically adjust the interaction weights between modalities along the class dimension, thereby fully leveraging the complementary advantages of different modalities. To accommodate the differentiated needs of different categories, a Category Cross-Calibration Mechanism (C<sup>3</sup>M) is designed to compress multi-channel features, estimate pixel-level class distributions, and employ a confidence-based cross-calibration strategy to dynamically adjust interaction weights along the class dimension, better accommodating the varying demands of different classes. To further semantic consistency and inter-class separability, a Category-Selective Transformer Module is proposed to leverage the class information calibrated by C<sup>3</sup>M for adaptive weighted fusion along the class dimension, thereby optimizing the representation of category-specific features. Experimental results indicate that CSFAFormer significantly outperforms in segmentation performance. Compared to the CAFM, CSFAFormer reduces the parameter count by 38.5 % and the computational cost by 72.3 %, while maintaining superior performance. The code is available at: <span><span>https://github.com/NUAALISILab/CSFAFormer</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"127 \",\"pages\":\"Article 103786\"},\"PeriodicalIF\":15.5000,\"publicationDate\":\"2025-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525008486\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525008486","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

特征融合是多模态数据分割的关键之一。不同的融合机制在如何有效地利用多模式特征、利用互补信息和增强表征方面差异很大，同时也极大地影响了模型参数和计算复杂度。交叉注意融合机制（Cross-attention fusion mechanism， CAFM）是目前多模态融合分类任务中应用最广泛的特征融合机制，但由于其固有的局限性，无法适应不同类别对特征的差异化需求，导致类间特征的模糊和类内特征的分散。为了解决这些问题，提出了一种新的类别选择特征聚合转换器（Category-Selective Feature Aggregation Transformer, CSFAFormer），它可以沿着类维动态调整模式之间的交互权重，从而充分发挥不同模式的互补优势。为了适应不同类别的差异化需求，设计了类别交叉校准机制（C3M），压缩多通道特征，估计像素级类别分布，并采用基于置信度的交叉校准策略沿类别维度动态调整交互权重，更好地适应不同类别的不同需求。为了进一步提高语义一致性和类间可分离性，提出了一种类别选择转换器模块，利用C3M校准的类别信息沿着类维进行自适应加权融合，从而优化类别特定特征的表示。实验结果表明，CSFAFormer在分割性能上有明显的优势。与CAFM相比，CSFAFormer在保持优越性能的同时，减少了38.5%的参数计数和72.3%的计算成本。代码可从https://github.com/NUAALISILab/CSFAFormer获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CSFAFormer: Category-selective feature aggregation transformer for multimodal remote sensing image semantic segmentation

Feature fusion is one of the keys to multimodal data segmentation. Different fusion mechanisms vary significantly in how effectively they utilize inter-modal features, exploit complementary information, and enhance representations, while also greatly affecting model parameters and computational complexity. Cross-attention fusion mechanism (CAFM) is the most widely used feature fusion mechanism in the current multimodal fusion classification task, but due to the inherent limitation, it cannot adapt to the differentiated feature requirements of different classes and leads to the blurring of interclass and dispersal features of intraclass. To address these challenges, a novel Category-Selective Feature Aggregation Transformer (CSFAFormer) is proposed to dynamically adjust the interaction weights between modalities along the class dimension, thereby fully leveraging the complementary advantages of different modalities. To accommodate the differentiated needs of different categories, a Category Cross-Calibration Mechanism (C³M) is designed to compress multi-channel features, estimate pixel-level class distributions, and employ a confidence-based cross-calibration strategy to dynamically adjust interaction weights along the class dimension, better accommodating the varying demands of different classes. To further semantic consistency and inter-class separability, a Category-Selective Transformer Module is proposed to leverage the class information calibrated by C³M for adaptive weighted fusion along the class dimension, thereby optimizing the representation of category-specific features. Experimental results indicate that CSFAFormer significantly outperforms in segmentation performance. Compared to the CAFM, CSFAFormer reduces the parameter count by 38.5 % and the computational cost by 72.3 %, while maintaining superior performance. The code is available at: https://github.com/NUAALISILab/CSFAFormer.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.