{"title":"Joint Transformer and Mamba fusion for multispectral object detection","authors":"Chao Li, Xiaoming Peng","doi":"10.1016/j.imavis.2025.105468","DOIUrl":null,"url":null,"abstract":"<div><div>Multispectral object detection is generally considered better than single-modality-based object detection, due to the complementary properties of multispectral image pairs. However, how to integrate features from images of different modalities for object detection is still an open problem. In this paper, we propose a new multispectral object detection framework based on the Transformer and Mamba architectures, called the joint Transformer and Mamba detection (JTMDet). Specifically, we divide the feature fusion process into two stages, the intra-scale fusion stage and the inter-scale fusion stage, to comprehensively utilize the multi-modal features at different scales. To this end, we designed the so-called cross-modal fusion (CMF) and cross-level fusion (CLF) modules, both of which contain JTMBlock modules. A JTMBlock module interweaves the Transformer and Mamba layers to robustly capture the useful information in multispectral image pairs while maintaining high inference speed. Extensive experiments on three publicly available datasets conclusively show that the proposed JTMDet framework achieves state-of-the-art multispectral object detection performance, and is competitive with current leading methods. Code and pre-trained models are publicly available at <span><span>https://github.com/LiC2023/JTMDet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105468"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000563","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multispectral object detection is generally considered better than single-modality-based object detection, due to the complementary properties of multispectral image pairs. However, how to integrate features from images of different modalities for object detection is still an open problem. In this paper, we propose a new multispectral object detection framework based on the Transformer and Mamba architectures, called the joint Transformer and Mamba detection (JTMDet). Specifically, we divide the feature fusion process into two stages, the intra-scale fusion stage and the inter-scale fusion stage, to comprehensively utilize the multi-modal features at different scales. To this end, we designed the so-called cross-modal fusion (CMF) and cross-level fusion (CLF) modules, both of which contain JTMBlock modules. A JTMBlock module interweaves the Transformer and Mamba layers to robustly capture the useful information in multispectral image pairs while maintaining high inference speed. Extensive experiments on three publicly available datasets conclusively show that the proposed JTMDet framework achieves state-of-the-art multispectral object detection performance, and is competitive with current leading methods. Code and pre-trained models are publicly available at https://github.com/LiC2023/JTMDet.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.