Lingyun Tian;Qiang Shen;Zilong Deng;Yang Gao;Simiao Wang
{"title":"基于掩模制导的可见光-红外车辆检测交叉模态融合网络","authors":"Lingyun Tian;Qiang Shen;Zilong Deng;Yang Gao;Simiao Wang","doi":"10.1109/LSP.2025.3562816","DOIUrl":null,"url":null,"abstract":"Drone-based vehicle detection is crucial for intelligent traffic management. However, current methods relying solely on single visible or infrared modalities struggle with precision and robustness, especially in adverse weather conditions. The effective integration of cross-modal information to enhance vehicle detection still poses significant challenges. In this letter, we propose a masked-guided cross-modality fusion method, called MCMF, for robust and accurate visible-infrared vehicle detection. Firstly, we construct a framework consisting of three branches, with two dedicated to the visible and infrared modalities respectively, and another tailored for the fused multi-modal. Secondly, we introduce a Location-Sensitive Masked AutoEncoder (LMAE) for intermediate-level feature fusion. Specifically, our LMAE utilizes masks to cover intermediate-level features of one modality based on the prediction hierarchy of another modality, and then distills cross-modality guidance information through regularization constraints. This strategy, through a self-learning paradigm, effectively preserves the useful information from both modalities while eliminating redundant information from each. Finally, the fused features are input into an uncertainty-based detection head to generate predictions for bounding boxes of vehicles. When evaluated on the DroneVehicle dataset, our MCIF reaches 71.42% w.r..t. mAP, outperforming an established baseline method by 7.42%. Ablation studies further demonstrate the effectiveness of our LMAE for visible-infrared fusion.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"1815-1819"},"PeriodicalIF":3.2000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mask-Guided Cross-Modality Fusion Network for Visible-Infrared Vehicle Detection\",\"authors\":\"Lingyun Tian;Qiang Shen;Zilong Deng;Yang Gao;Simiao Wang\",\"doi\":\"10.1109/LSP.2025.3562816\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Drone-based vehicle detection is crucial for intelligent traffic management. However, current methods relying solely on single visible or infrared modalities struggle with precision and robustness, especially in adverse weather conditions. The effective integration of cross-modal information to enhance vehicle detection still poses significant challenges. In this letter, we propose a masked-guided cross-modality fusion method, called MCMF, for robust and accurate visible-infrared vehicle detection. Firstly, we construct a framework consisting of three branches, with two dedicated to the visible and infrared modalities respectively, and another tailored for the fused multi-modal. Secondly, we introduce a Location-Sensitive Masked AutoEncoder (LMAE) for intermediate-level feature fusion. Specifically, our LMAE utilizes masks to cover intermediate-level features of one modality based on the prediction hierarchy of another modality, and then distills cross-modality guidance information through regularization constraints. This strategy, through a self-learning paradigm, effectively preserves the useful information from both modalities while eliminating redundant information from each. Finally, the fused features are input into an uncertainty-based detection head to generate predictions for bounding boxes of vehicles. When evaluated on the DroneVehicle dataset, our MCIF reaches 71.42% w.r..t. mAP, outperforming an established baseline method by 7.42%. Ablation studies further demonstrate the effectiveness of our LMAE for visible-infrared fusion.\",\"PeriodicalId\":13154,\"journal\":{\"name\":\"IEEE Signal Processing Letters\",\"volume\":\"32 \",\"pages\":\"1815-1819\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Signal Processing Letters\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10971225/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10971225/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Mask-Guided Cross-Modality Fusion Network for Visible-Infrared Vehicle Detection
Drone-based vehicle detection is crucial for intelligent traffic management. However, current methods relying solely on single visible or infrared modalities struggle with precision and robustness, especially in adverse weather conditions. The effective integration of cross-modal information to enhance vehicle detection still poses significant challenges. In this letter, we propose a masked-guided cross-modality fusion method, called MCMF, for robust and accurate visible-infrared vehicle detection. Firstly, we construct a framework consisting of three branches, with two dedicated to the visible and infrared modalities respectively, and another tailored for the fused multi-modal. Secondly, we introduce a Location-Sensitive Masked AutoEncoder (LMAE) for intermediate-level feature fusion. Specifically, our LMAE utilizes masks to cover intermediate-level features of one modality based on the prediction hierarchy of another modality, and then distills cross-modality guidance information through regularization constraints. This strategy, through a self-learning paradigm, effectively preserves the useful information from both modalities while eliminating redundant information from each. Finally, the fused features are input into an uncertainty-based detection head to generate predictions for bounding boxes of vehicles. When evaluated on the DroneVehicle dataset, our MCIF reaches 71.42% w.r..t. mAP, outperforming an established baseline method by 7.42%. Ablation studies further demonstrate the effectiveness of our LMAE for visible-infrared fusion.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.