Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios

IF 14.5 Q1 TRANSPORTATION

Communications in Transportation Research Pub Date : 2025-08-18 DOI:10.1016/j.commtr.2025.100202

Zhanwen Liu , Yujing Sun , Yang Wang , Nan Yang , Shengbo Eben Li , Xiangmo Zhao

{"title":"Beyond conventional vision: RGB-event fusion for robust object detection in dynamic traffic scenarios","authors":"Zhanwen Liu , Yujing Sun , Yang Wang , Nan Yang , Shengbo Eben Li , Xiangmo Zhao","doi":"10.1016/j.commtr.2025.100202","DOIUrl":null,"url":null,"abstract":"<div><div>The dynamic range limitation is intrinsic to conventional RGB cameras, which reduces global contrast and causes the loss of high-frequency details such as textures and edges in complex, dynamic traffic environments (e.g., nighttime driving or tunnel scenes). This deficiency hinders the extraction of discriminative features and degrades the performance of frame-based traffic object detection. To address this problem, we introduce a bio-inspired event camera integrated with an RGB camera to complement high dynamic range information, and propose a motion cue fusion network (MCFNet), an innovative fusion network that optimally achieves spatiotemporal alignment and develops an adaptive strategy for cross-modal feature fusion, to overcome performance degradation under challenging lighting conditions. Specifically, we design an event correction module (ECM) that temporally aligns asynchronous event streams with their corresponding image frames through optical-flow-based warping. The ECM is jointly optimized with the downstream object detection network to learn task-ware event representations. Subsequently, the event dynamic upsampling module (EDUM) enhances the spatial resolution of event frames to align its distribution with the structures of image pixels, achieving precise spatiotemporal alignment. Finally, the cross-modal mamba fusion module (CMM) employs adaptive feature fusion through a novel cross-modal interlaced scanning mechanism, effectively integrating complementary information for robust detection performance. Experiments conducted on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios. Notably, on the DSEC-Det dataset, MCFNet achieves a remarkable improvement, surpassing the best existing methods by 7.4% in mAP50 and 1.7% in mAP metrics, respectively.</div></div>","PeriodicalId":100292,"journal":{"name":"Communications in Transportation Research","volume":"5 ","pages":"Article 100202"},"PeriodicalIF":14.5000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Communications in Transportation Research","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772424725000423","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"TRANSPORTATION","Score":null,"Total":0}

引用次数: 0

Abstract

The dynamic range limitation is intrinsic to conventional RGB cameras, which reduces global contrast and causes the loss of high-frequency details such as textures and edges in complex, dynamic traffic environments (e.g., nighttime driving or tunnel scenes). This deficiency hinders the extraction of discriminative features and degrades the performance of frame-based traffic object detection. To address this problem, we introduce a bio-inspired event camera integrated with an RGB camera to complement high dynamic range information, and propose a motion cue fusion network (MCFNet), an innovative fusion network that optimally achieves spatiotemporal alignment and develops an adaptive strategy for cross-modal feature fusion, to overcome performance degradation under challenging lighting conditions. Specifically, we design an event correction module (ECM) that temporally aligns asynchronous event streams with their corresponding image frames through optical-flow-based warping. The ECM is jointly optimized with the downstream object detection network to learn task-ware event representations. Subsequently, the event dynamic upsampling module (EDUM) enhances the spatial resolution of event frames to align its distribution with the structures of image pixels, achieving precise spatiotemporal alignment. Finally, the cross-modal mamba fusion module (CMM) employs adaptive feature fusion through a novel cross-modal interlaced scanning mechanism, effectively integrating complementary information for robust detection performance. Experiments conducted on the DSEC-Det and PKU-DAVIS-SOD datasets demonstrate that MCFNet significantly outperforms existing methods in various poor lighting and fast moving traffic scenarios. Notably, on the DSEC-Det dataset, MCFNet achieves a remarkable improvement, surpassing the best existing methods by 7.4% in mAP50 and 1.7% in mAP metrics, respectively.

查看原文本刊更多论文

超越传统视觉：动态交通场景中鲁棒目标检测的rgb -事件融合

动态范围限制是传统RGB相机固有的，它会降低全局对比度，并导致在复杂的动态交通环境（例如夜间驾驶或隧道场景）中丢失高频细节，如纹理和边缘。这一缺陷阻碍了判别特征的提取，降低了基于帧的流量目标检测的性能。为了解决这一问题，我们引入了一种与RGB相机集成的生物启发事件相机来补充高动态范围信息，并提出了一种运动线索融合网络（MCFNet），这是一种创新的融合网络，可以最佳地实现时空对齐，并开发了一种跨模态特征融合的自适应策略，以克服在具有挑战性的照明条件下的性能下降。具体来说，我们设计了一个事件校正模块（ECM），该模块通过基于光流的扭曲暂时将异步事件流与其相应的图像帧对齐。ECM与下游目标检测网络联合优化，学习任务件事件表示。随后，事件动态上采样模块（EDUM）提高事件帧的空间分辨率，使其分布与图像像素结构对齐，实现精确的时空对齐。最后，跨模态曼巴融合模块（CMM）通过一种新颖的跨模态交错扫描机制，采用自适应特征融合，有效整合互补信息，实现鲁棒检测性能。在DSEC-Det和PKU-DAVIS-SOD数据集上进行的实验表明，在各种光线不足和快速移动的交通场景下，MCFNet显著优于现有方法。值得注意的是，在DSEC-Det数据集上，MCFNet取得了显著的改进，在mAP50和mAP指标上分别比现有的最佳方法高出7.4%和1.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Communications in Transportation Research

CiteScore

15.20

自引率

0.00%

发文量