{"title":"Deformable Cross-Attention Transformer for Weakly Aligned RGB–T Pedestrian Detection","authors":"Yu Hu;Xiaobo Chen;Sheng Wang;Luyang Liu;Hengyang Shi;Lihong Fan;Jing Tian;Jun Liang","doi":"10.1109/TMM.2025.3543056","DOIUrl":null,"url":null,"abstract":"Pedestrian detection plays a crucial role in autonomous driving systems. To ensure reliable and effective detection in challenging conditions, researchers have proposed RGB–T (RGB–thermal) detectors that integrate thermal images with color images for more complementary feature representations. However, existing methods face challenges in capturing the spatial and geometric correlations between different modalities, as well as in assuming perfect synchronization of the two modalities, which is unrealistic in real-world scenarios. In response to these challenges, we present a new deformable-attention-based approach for weakly aligned RGB–T pedestrian detection. The proposed method uses a dual-branch cross-attention mechanism to capture the inherent spatial and geometric correlations between color and thermal images. Furthermore, it incorporates positional information for each image pixel into the sampling offset generation to enhance robustness in scenarios where modalities are not precisely aligned or registered. To reduce computational complexity, we introduce a local attention mechanism that samples only a small set of keys and values within a limited region in the feature maps for each query. Extensive experiments and ablation studies conducted on multiple public datasets confirm the effectiveness of the proposed framework.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4400-4411"},"PeriodicalIF":9.7000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891492/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Pedestrian detection plays a crucial role in autonomous driving systems. To ensure reliable and effective detection in challenging conditions, researchers have proposed RGB–T (RGB–thermal) detectors that integrate thermal images with color images for more complementary feature representations. However, existing methods face challenges in capturing the spatial and geometric correlations between different modalities, as well as in assuming perfect synchronization of the two modalities, which is unrealistic in real-world scenarios. In response to these challenges, we present a new deformable-attention-based approach for weakly aligned RGB–T pedestrian detection. The proposed method uses a dual-branch cross-attention mechanism to capture the inherent spatial and geometric correlations between color and thermal images. Furthermore, it incorporates positional information for each image pixel into the sampling offset generation to enhance robustness in scenarios where modalities are not precisely aligned or registered. To reduce computational complexity, we introduce a local attention mechanism that samples only a small set of keys and values within a limited region in the feature maps for each query. Extensive experiments and ablation studies conducted on multiple public datasets confirm the effectiveness of the proposed framework.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.