{"title":"Low-Rank Multimodal Remote Sensing Object Detection With Frequency Filtering Experts","authors":"Xu Sun;Yinhui Yu;Qing Cheng","doi":"10.1109/TGRS.2024.3446814","DOIUrl":null,"url":null,"abstract":"Visible-infrared object detection for remote sensing images plays an important role in the unmanned aerial vehicle (UAV) around-the-clock application. Most of the existing work focuses on designing complex network architectures to fuse complementary features, while few methods consider computational complexity and susceptibility against modality attacks, limiting the deployment of state-of-the-art frameworks. In this article, we present a low-rank multimodal object detection approach with frequency filtering experts, called LF-MDet, which is based on the advanced DINO (detection transformer with improved denoising anchor boxes) framework. This approach achieves more accurate detection with fewer computational resources. In particular, when the specific modality is attacked or missing, our method still maintains higher robustness toward such pervasive perturbations. Specifically, we propose a low-rank enhancement technology (LET) and a dynamic illumination-aware mask (DIM) module to enable a single backbone network in the form of batch formulation to unbiasedly and compatibly extract multimodal features. Furthermore, we design a lightweight frequency expert encoder (FEE) from the frequency domain perspective to efficiently fuse complementary features by filtering out amplitude noise components and mixing feature tokens. Extensive experiments are conducted on the multimodal remote sensing object detection datasets, VEDAI and DroneVehicle. The results demonstrate the superiority of the proposed approach over advanced multimodal remote sensing object detectors. Compared to the baseline method, our low-rank multimodal detector (LF-MDet) effectively reduces the floating point of operations (FLOPs) by approximately 65% while improving detection accuracy. The code is available at \n<uri>https://github.com/cq100/LF-MDet</uri>\n.","PeriodicalId":13213,"journal":{"name":"IEEE Transactions on Geoscience and Remote Sensing","volume":"62 ","pages":"1-14"},"PeriodicalIF":7.5000,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Geoscience and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10643097/","RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Visible-infrared object detection for remote sensing images plays an important role in the unmanned aerial vehicle (UAV) around-the-clock application. Most of the existing work focuses on designing complex network architectures to fuse complementary features, while few methods consider computational complexity and susceptibility against modality attacks, limiting the deployment of state-of-the-art frameworks. In this article, we present a low-rank multimodal object detection approach with frequency filtering experts, called LF-MDet, which is based on the advanced DINO (detection transformer with improved denoising anchor boxes) framework. This approach achieves more accurate detection with fewer computational resources. In particular, when the specific modality is attacked or missing, our method still maintains higher robustness toward such pervasive perturbations. Specifically, we propose a low-rank enhancement technology (LET) and a dynamic illumination-aware mask (DIM) module to enable a single backbone network in the form of batch formulation to unbiasedly and compatibly extract multimodal features. Furthermore, we design a lightweight frequency expert encoder (FEE) from the frequency domain perspective to efficiently fuse complementary features by filtering out amplitude noise components and mixing feature tokens. Extensive experiments are conducted on the multimodal remote sensing object detection datasets, VEDAI and DroneVehicle. The results demonstrate the superiority of the proposed approach over advanced multimodal remote sensing object detectors. Compared to the baseline method, our low-rank multimodal detector (LF-MDet) effectively reduces the floating point of operations (FLOPs) by approximately 65% while improving detection accuracy. The code is available at
https://github.com/cq100/LF-MDet
.
期刊介绍:
IEEE Transactions on Geoscience and Remote Sensing (TGRS) is a monthly publication that focuses on the theory, concepts, and techniques of science and engineering as applied to sensing the land, oceans, atmosphere, and space; and the processing, interpretation, and dissemination of this information.