Zhanwen Liu , Yujing Sun , Yang Wang , Nan Yang , Shengbo Eben Li , Xiangmo Zhao
{"title":"Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios","authors":"Zhanwen Liu , Yujing Sun , Yang Wang , Nan Yang , Shengbo Eben Li , Xiangmo Zhao","doi":"10.1016/j.commtr.2025.100202","DOIUrl":null,"url":null,"abstract":"<div><div>The dynamic range limitation is intrinsic to conventional RGB cameras, which reduces global contrast and causes the loss of high-frequency details such as textures and edges in complex, dynamic traffic environments (e.g., nighttime driving or tunnel scenes). This deficiency hinders the extraction of discriminative features and degrades the performance of frame-based traffic object detection. To address this problem, we introduce a bio-inspired event camera integrated with an RGB camera to complement high dynamic range information, and propose a motion cue fusion network (MCFNet), an innovative fusion network that optimally achieves spatiotemporal alignment and develops an adaptive strategy for cross-modal feature fusion, to overcome performance degradation under challenging lighting conditions. Specifically, we design an event correction module (ECM) that temporally aligns asynchronous event streams with their corresponding image frames through optical-flow-based warping. The ECM is jointly optimized with the downstream object detection network to learn task-ware event representations. Subsequently, the event dynamic upsampling module (EDUM) enhances the spatial resolution of event frames to align its distribution with the structures of image pixels, achieving precise spatiotemporal alignment. Finally, the cross-modal mamba fusion module (CMM) employs adaptive feature fusion through a novel cross-modal interlaced scanning mechanism, effectively integrating complementary information for robust detection performance. Experiments conducted on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios. Notably, on the DSEC-Det dataset, MCFNet achieves a remarkable improvement, surpassing the best existing methods by 7.4% in mAP50 and 1.7% in mAP metrics, respectively.</div></div>","PeriodicalId":100292,"journal":{"name":"Communications in Transportation Research","volume":"5 ","pages":"Article 100202"},"PeriodicalIF":14.5000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications in Transportation Research","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772424725000423","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"TRANSPORTATION","Score":null,"Total":0}
引用次数: 0
Abstract
The dynamic range limitation is intrinsic to conventional RGB cameras, which reduces global contrast and causes the loss of high-frequency details such as textures and edges in complex, dynamic traffic environments (e.g., nighttime driving or tunnel scenes). This deficiency hinders the extraction of discriminative features and degrades the performance of frame-based traffic object detection. To address this problem, we introduce a bio-inspired event camera integrated with an RGB camera to complement high dynamic range information, and propose a motion cue fusion network (MCFNet), an innovative fusion network that optimally achieves spatiotemporal alignment and develops an adaptive strategy for cross-modal feature fusion, to overcome performance degradation under challenging lighting conditions. Specifically, we design an event correction module (ECM) that temporally aligns asynchronous event streams with their corresponding image frames through optical-flow-based warping. The ECM is jointly optimized with the downstream object detection network to learn task-ware event representations. Subsequently, the event dynamic upsampling module (EDUM) enhances the spatial resolution of event frames to align its distribution with the structures of image pixels, achieving precise spatiotemporal alignment. Finally, the cross-modal mamba fusion module (CMM) employs adaptive feature fusion through a novel cross-modal interlaced scanning mechanism, effectively integrating complementary information for robust detection performance. Experiments conducted on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios. Notably, on the DSEC-Det dataset, MCFNet achieves a remarkable improvement, surpassing the best existing methods by 7.4% in mAP50 and 1.7% in mAP metrics, respectively.