{"title":"YOLO-G3CF: Gaussian Contrastive Cross-Channel Fusion for Multimodal Object Detection","authors":"Abdelbadie Belmouhcine;Minh-Tan Pham;Sébastien Lefèvre","doi":"10.1109/LGRS.2025.3564181","DOIUrl":null,"url":null,"abstract":"Object detection is a crucial task in both computer vision and remote sensing. The performance of object detectors can vary across different modalities depending on lighting and weather conditions. To address these challenges, we propose a fusion module based on contrastive learning and Gaussian cross-channel attention, called Gaussian contrastive cross-channel fusion (G3CF). We integrate this module into a dual-you only look once (YOLO) architecture, forming YOLO-G3CF. The contrastive loss enforces similarity between the features sent to the detection head from both modality branches, as they should lead to the same detections. The Gaussian attention mechanism enables the model to fuse features in a higher dimensional space, enhancing discriminative power. Extensive experiments on VEDAI, GeoImageNet, VTUAV-det, and FLIR demonstrate that G3CF improves detection performance, achieving a mAP increase of up to 6.64% over the best single-modality baselines and outperforming prior multimodal fusion methods. Regarding model complexity, our fusion method operates at a late stage, increasing the computational cost of single-modality YOLO by approximately 150% in terms of giga floating-point operations per second (GFLOP). For instance, YOLOv8 requires 52.84 GFLOPs, whereas YOLOv8-G3CF, due to its dual architecture and three G3CF modules, increases this to 131.22 GFLOPs. However, a single G3CF module requires only ~15 GFLOPs. Despite this overhead, our approach remains computationally less expensive than transformer-based models, e.g., ICAFusion requires 284.80 GFLOPs. Moreover, the proposed method still operates in real-time, achieving ~19 FPS on an NVIDIA RTX 2080. The code is available at <uri>https://github.com/abelmouhcine/YOLO-G3CF</uri>.","PeriodicalId":91017,"journal":{"name":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","volume":"22 ","pages":"1-5"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE geoscience and remote sensing letters : a publication of the IEEE Geoscience and Remote Sensing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10975811/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Object detection is a crucial task in both computer vision and remote sensing. The performance of object detectors can vary across different modalities depending on lighting and weather conditions. To address these challenges, we propose a fusion module based on contrastive learning and Gaussian cross-channel attention, called Gaussian contrastive cross-channel fusion (G3CF). We integrate this module into a dual-you only look once (YOLO) architecture, forming YOLO-G3CF. The contrastive loss enforces similarity between the features sent to the detection head from both modality branches, as they should lead to the same detections. The Gaussian attention mechanism enables the model to fuse features in a higher dimensional space, enhancing discriminative power. Extensive experiments on VEDAI, GeoImageNet, VTUAV-det, and FLIR demonstrate that G3CF improves detection performance, achieving a mAP increase of up to 6.64% over the best single-modality baselines and outperforming prior multimodal fusion methods. Regarding model complexity, our fusion method operates at a late stage, increasing the computational cost of single-modality YOLO by approximately 150% in terms of giga floating-point operations per second (GFLOP). For instance, YOLOv8 requires 52.84 GFLOPs, whereas YOLOv8-G3CF, due to its dual architecture and three G3CF modules, increases this to 131.22 GFLOPs. However, a single G3CF module requires only ~15 GFLOPs. Despite this overhead, our approach remains computationally less expensive than transformer-based models, e.g., ICAFusion requires 284.80 GFLOPs. Moreover, the proposed method still operates in real-time, achieving ~19 FPS on an NVIDIA RTX 2080. The code is available at https://github.com/abelmouhcine/YOLO-G3CF.