联合变压器和曼巴融合用于多光谱目标检测

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-02-27 DOI:10.1016/j.imavis.2025.105468

Chao Li, Xiaoming Peng

{"title":"联合变压器和曼巴融合用于多光谱目标检测","authors":"Chao Li, Xiaoming Peng","doi":"10.1016/j.imavis.2025.105468","DOIUrl":null,"url":null,"abstract":"<div><div>Multispectral object detection is generally considered better than single-modality-based object detection, due to the complementary properties of multispectral image pairs. However, how to integrate features from images of different modalities for object detection is still an open problem. In this paper, we propose a new multispectral object detection framework based on the Transformer and Mamba architectures, called the joint Transformer and Mamba detection (JTMDet). Specifically, we divide the feature fusion process into two stages, the intra-scale fusion stage and the inter-scale fusion stage, to comprehensively utilize the multi-modal features at different scales. To this end, we designed the so-called cross-modal fusion (CMF) and cross-level fusion (CLF) modules, both of which contain JTMBlock modules. A JTMBlock module interweaves the Transformer and Mamba layers to robustly capture the useful information in multispectral image pairs while maintaining high inference speed. Extensive experiments on three publicly available datasets conclusively show that the proposed JTMDet framework achieves state-of-the-art multispectral object detection performance, and is competitive with current leading methods. Code and pre-trained models are publicly available at <span><span>https://github.com/LiC2023/JTMDet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"156 ","pages":"Article 105468"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Joint Transformer and Mamba fusion for multispectral object detection\",\"authors\":\"Chao Li, Xiaoming Peng\",\"doi\":\"10.1016/j.imavis.2025.105468\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Multispectral object detection is generally considered better than single-modality-based object detection, due to the complementary properties of multispectral image pairs. However, how to integrate features from images of different modalities for object detection is still an open problem. In this paper, we propose a new multispectral object detection framework based on the Transformer and Mamba architectures, called the joint Transformer and Mamba detection (JTMDet). Specifically, we divide the feature fusion process into two stages, the intra-scale fusion stage and the inter-scale fusion stage, to comprehensively utilize the multi-modal features at different scales. To this end, we designed the so-called cross-modal fusion (CMF) and cross-level fusion (CLF) modules, both of which contain JTMBlock modules. A JTMBlock module interweaves the Transformer and Mamba layers to robustly capture the useful information in multispectral image pairs while maintaining high inference speed. Extensive experiments on three publicly available datasets conclusively show that the proposed JTMDet framework achieves state-of-the-art multispectral object detection performance, and is competitive with current leading methods. Code and pre-trained models are publicly available at <span><span>https://github.com/LiC2023/JTMDet</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"156 \",\"pages\":\"Article 105468\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885625000563\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000563","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

由于多光谱图像对的互补特性，通常认为多光谱目标检测优于基于单模态的目标检测。然而，如何整合不同模态图像的特征进行目标检测仍然是一个有待解决的问题。在本文中，我们提出了一种新的基于Transformer和Mamba结构的多光谱目标检测框架，称为联合Transformer和Mamba检测（JTMDet）。具体而言，我们将特征融合过程分为尺度内融合阶段和尺度间融合阶段，以综合利用不同尺度上的多模态特征。为此，我们设计了所谓的跨模态融合（CMF）和跨层融合（CLF）模块，这两个模块都包含JTMBlock模块。JTMBlock模块将Transformer层和Mamba层交织在一起，在保持高推断速度的同时健壮地捕获多光谱图像对中的有用信息。在三个公开数据集上进行的大量实验最终表明，所提出的JTMDet框架实现了最先进的多光谱目标检测性能，并且与当前领先的方法具有竞争力。代码和预训练模型可在https://github.com/LiC2023/JTMDet上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Joint Transformer and Mamba fusion for multispectral object detection

查看原文本刊更多论文

Joint Transformer and Mamba fusion for multispectral object detection

Multispectral object detection is generally considered better than single-modality-based object detection, due to the complementary properties of multispectral image pairs. However, how to integrate features from images of different modalities for object detection is still an open problem. In this paper, we propose a new multispectral object detection framework based on the Transformer and Mamba architectures, called the joint Transformer and Mamba detection (JTMDet). Specifically, we divide the feature fusion process into two stages, the intra-scale fusion stage and the inter-scale fusion stage, to comprehensively utilize the multi-modal features at different scales. To this end, we designed the so-called cross-modal fusion (CMF) and cross-level fusion (CLF) modules, both of which contain JTMBlock modules. A JTMBlock module interweaves the Transformer and Mamba layers to robustly capture the useful information in multispectral image pairs while maintaining high inference speed. Extensive experiments on three publicly available datasets conclusively show that the proposed JTMDet framework achieves state-of-the-art multispectral object detection performance, and is competitive with current leading methods. Code and pre-trained models are publicly available at https://github.com/LiC2023/JTMDet.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.