{"title":"基于跨模态融合冲突消除的三维单目标跟踪","authors":"Yushi Yang;Wei Li;Ying Yao;Bo Zhou;Baojie Fan","doi":"10.1109/LRA.2025.3551951","DOIUrl":null,"url":null,"abstract":"3D single object tracking based on point clouds is a key challenge in robotics and autonomous driving technology. Mainstream methods rely on point clouds for geometric matching or motion estimation between the target template and the search area. However, the lack of texture and the sparsity of incomplete point clouds make it difficult for unimodal trackers to distinguish objects with similar structures. To overcome the limitations of previous methods, this letter proposes a cross-modal fusion conflict elimination tracker (CCETrack). The point clouds collected by LiDAR provide accurate depth and shape information about the surrounding environment, while the camera sensor provides RGB images containing rich semantic and texture information. CCETrack fully leverages both modalities to track 3D objects. Specifically, to address cross-modal conflicts caused by heterogeneous sensors, we propose a global context alignment module that aligns RGB images with point clouds and generates enhanced image features. Then, a sparse feature enhancement module is designed to optimize voxelized point cloud features using the rich image features. In the feature fusion stage, both modalities are converted into BEV features, with the template and search area features fused separately. A self-attention mechanism is employed to establish bidirectional communication between regions. Our method maximizes the use of effective information and achieves state-of-the-art performance on the KITTI and nuScenes datasets through multimodal complementarity.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 5","pages":"4826-4833"},"PeriodicalIF":4.6000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"3D Single Object Tracking With Cross-Modal Fusion Conflict Elimination\",\"authors\":\"Yushi Yang;Wei Li;Ying Yao;Bo Zhou;Baojie Fan\",\"doi\":\"10.1109/LRA.2025.3551951\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"3D single object tracking based on point clouds is a key challenge in robotics and autonomous driving technology. Mainstream methods rely on point clouds for geometric matching or motion estimation between the target template and the search area. However, the lack of texture and the sparsity of incomplete point clouds make it difficult for unimodal trackers to distinguish objects with similar structures. To overcome the limitations of previous methods, this letter proposes a cross-modal fusion conflict elimination tracker (CCETrack). The point clouds collected by LiDAR provide accurate depth and shape information about the surrounding environment, while the camera sensor provides RGB images containing rich semantic and texture information. CCETrack fully leverages both modalities to track 3D objects. Specifically, to address cross-modal conflicts caused by heterogeneous sensors, we propose a global context alignment module that aligns RGB images with point clouds and generates enhanced image features. Then, a sparse feature enhancement module is designed to optimize voxelized point cloud features using the rich image features. In the feature fusion stage, both modalities are converted into BEV features, with the template and search area features fused separately. A self-attention mechanism is employed to establish bidirectional communication between regions. Our method maximizes the use of effective information and achieves state-of-the-art performance on the KITTI and nuScenes datasets through multimodal complementarity.\",\"PeriodicalId\":13241,\"journal\":{\"name\":\"IEEE Robotics and Automation Letters\",\"volume\":\"10 5\",\"pages\":\"4826-4833\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2025-03-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Robotics and Automation Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10930556/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ROBOTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10930556/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}
3D Single Object Tracking With Cross-Modal Fusion Conflict Elimination
3D single object tracking based on point clouds is a key challenge in robotics and autonomous driving technology. Mainstream methods rely on point clouds for geometric matching or motion estimation between the target template and the search area. However, the lack of texture and the sparsity of incomplete point clouds make it difficult for unimodal trackers to distinguish objects with similar structures. To overcome the limitations of previous methods, this letter proposes a cross-modal fusion conflict elimination tracker (CCETrack). The point clouds collected by LiDAR provide accurate depth and shape information about the surrounding environment, while the camera sensor provides RGB images containing rich semantic and texture information. CCETrack fully leverages both modalities to track 3D objects. Specifically, to address cross-modal conflicts caused by heterogeneous sensors, we propose a global context alignment module that aligns RGB images with point clouds and generates enhanced image features. Then, a sparse feature enhancement module is designed to optimize voxelized point cloud features using the rich image features. In the feature fusion stage, both modalities are converted into BEV features, with the template and search area features fused separately. A self-attention mechanism is employed to establish bidirectional communication between regions. Our method maximizes the use of effective information and achieves state-of-the-art performance on the KITTI and nuScenes datasets through multimodal complementarity.
期刊介绍:
The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.