High performance RGB-Thermal Video Object Detection via hybrid fusion with progressive interaction and temporal-modal difference

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2024-09-12 DOI:10.1016/j.inffus.2024.102665

Qishun Wang , Zhengzheng Tu , Chenglong Li , Jin Tang

{"title":"High performance RGB-Thermal Video Object Detection via hybrid fusion with progressive interaction and temporal-modal difference","authors":"Qishun Wang , Zhengzheng Tu , Chenglong Li , Jin Tang","doi":"10.1016/j.inffus.2024.102665","DOIUrl":null,"url":null,"abstract":"<div><p>RGB-Thermal Video Object Detection (RGBT VOD) is to localize and classify the predefined objects in visible and thermal spectrum videos. The key issue in RGBT VOD lies in integrating multi-modal information effectively to improve detection performance. Current multi-modal fusion methods predominantly employ middle fusion strategies, but the inherent modal difference directly influences the effect of multi-modal fusion. Although the early fusion strategy reduces the modality gap in the middle stage of the network, achieving in-depth feature interaction between different modalities remains challenging. In this work, we propose a novel hybrid fusion network called PTMNet, which effectively combines the early fusion strategy with the progressive interaction and the middle fusion strategy with the temporal-modal difference, for high performance RGBT VOD. In particular, we take each modality as a master modality to achieve an early fusion with other modalities as auxiliary information by progressive interaction. Such a design not only alleviates the modality gap but facilitates middle fusion. The temporal-modal difference models temporal information through spatial offsets and utilizes feature erasure between modalities to motivate the network to focus on shared objects in both modalities. The hybrid fusion can achieve high detection accuracy only using three input frames, which makes our PTMNet achieve a high inference speed. Experimental results show that our approach achieves state-of-the-art performance on the VT-VOD50 dataset and also operates at over 70 FPS. The code will be freely released at <span><span>https://github.com/tzz-ahu</span><svg><path></path></svg></span> for academic purposes.</p></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"114 ","pages":"Article 102665"},"PeriodicalIF":14.7000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253524004433","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

RGB-Thermal Video Object Detection (RGBT VOD) is to localize and classify the predefined objects in visible and thermal spectrum videos. The key issue in RGBT VOD lies in integrating multi-modal information effectively to improve detection performance. Current multi-modal fusion methods predominantly employ middle fusion strategies, but the inherent modal difference directly influences the effect of multi-modal fusion. Although the early fusion strategy reduces the modality gap in the middle stage of the network, achieving in-depth feature interaction between different modalities remains challenging. In this work, we propose a novel hybrid fusion network called PTMNet, which effectively combines the early fusion strategy with the progressive interaction and the middle fusion strategy with the temporal-modal difference, for high performance RGBT VOD. In particular, we take each modality as a master modality to achieve an early fusion with other modalities as auxiliary information by progressive interaction. Such a design not only alleviates the modality gap but facilitates middle fusion. The temporal-modal difference models temporal information through spatial offsets and utilizes feature erasure between modalities to motivate the network to focus on shared objects in both modalities. The hybrid fusion can achieve high detection accuracy only using three input frames, which makes our PTMNet achieve a high inference speed. Experimental results show that our approach achieves state-of-the-art performance on the VT-VOD50 dataset and also operates at over 70 FPS. The code will be freely released at https://github.com/tzz-ahu for academic purposes.

查看原文本刊更多论文

通过渐进交互和时态模态差异混合融合技术实现高性能 RGB 热视频物体检测

RGBT 视频物体检测（RGBT VOD）是对可见光和热光谱视频中的预定义物体进行定位和分类。RGBT VOD 的关键问题在于有效整合多模态信息以提高检测性能。目前的多模态融合方法主要采用中间融合策略，但固有的模态差异直接影响了多模态融合的效果。虽然早期融合策略减少了网络中间阶段的模态差距，但实现不同模态之间的深度特征交互仍具有挑战性。在这项工作中，我们提出了一种名为 PTMNet 的新型混合融合网络，它有效地结合了渐进交互的早期融合策略和时态模态差异的中期融合策略，以实现高性能的 RGBT VOD。具体而言，我们将每种模态作为主模态，通过渐进式交互实现与作为辅助信息的其他模态的早期融合。这样的设计不仅缓解了模态差距，还促进了中间融合。时间-模态差异通过空间偏移对时间信息进行建模，并利用模态间的特征擦除来促使网络关注两种模态中的共享对象。混合融合仅使用三个输入帧就能达到很高的检测精度，这使得我们的 PTMNet 实现了很高的推理速度。实验结果表明，我们的方法在 VT-VOD50 数据集上实现了最先进的性能，而且运行速度超过 70 FPS。该代码将在 https://github.com/tzz-ahu 上免费发布，用于学术研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.