SyNet: A Synergistic Network for 3D Object Detection Through Geometric-Semantic-Based Multi-Interaction Fusion

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI:10.1109/TMM.2025.3542993

Xiaoqin Zhang;Kenan Bi;Sixian Chan;Shijian Lu;Xiaolong Zhou

{"title":"SyNet: A Synergistic Network for 3D Object Detection Through Geometric-Semantic-Based Multi-Interaction Fusion","authors":"Xiaoqin Zhang;Kenan Bi;Sixian Chan;Shijian Lu;Xiaolong Zhou","doi":"10.1109/TMM.2025.3542993","DOIUrl":null,"url":null,"abstract":"Driven by rising demands in autonomous driving, robotics, <italic>etc.</i>, 3D object detection has recently achieved great advancement by fusing optical images and LiDAR point data. On the other hand, most existing optical-LiDAR fusion methods straightly overlay RGB images and point clouds without adequately exploiting the synergy between them, leading to suboptimal fusion and 3D detection performance. Additionally, they often suffer from limited localization accuracy without proper balancing of global and local object information. To address this issue, we design a synergistic network (SyNet) that fuses geometric information, semantic information, as well as global and local information of objects for robust and accurate 3D detection. The SyNet captures synergies between optical images and LiDAR point clouds from three perspectives. The first is geometric, which derives high-quality depth by projecting point clouds onto multi-view images, enriching optical RGB images with 3D spatial information for a more accurate interpretation of image semantics. The second is semantic, which voxelizes point clouds and establishes correspondences between the derived voxels and image pixels, enriching 3D point clouds with semantic information for more accurate 3D detection. The third is balancing local and global object information, which introduces deformable self-attention and cross-attention to process the two types of complementary information in parallel for more accurate object localization. Extensive experiments show that SyNet achieves 70.7% mAP and 73.5% NDS on the nuScenes test set, demonstrating its effectiveness and superiority as compared with the state-of-the-art.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4950-4960"},"PeriodicalIF":9.7000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891647/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Driven by rising demands in autonomous driving, robotics, etc., 3D object detection has recently achieved great advancement by fusing optical images and LiDAR point data. On the other hand, most existing optical-LiDAR fusion methods straightly overlay RGB images and point clouds without adequately exploiting the synergy between them, leading to suboptimal fusion and 3D detection performance. Additionally, they often suffer from limited localization accuracy without proper balancing of global and local object information. To address this issue, we design a synergistic network (SyNet) that fuses geometric information, semantic information, as well as global and local information of objects for robust and accurate 3D detection. The SyNet captures synergies between optical images and LiDAR point clouds from three perspectives. The first is geometric, which derives high-quality depth by projecting point clouds onto multi-view images, enriching optical RGB images with 3D spatial information for a more accurate interpretation of image semantics. The second is semantic, which voxelizes point clouds and establishes correspondences between the derived voxels and image pixels, enriching 3D point clouds with semantic information for more accurate 3D detection. The third is balancing local and global object information, which introduces deformable self-attention and cross-attention to process the two types of complementary information in parallel for more accurate object localization. Extensive experiments show that SyNet achieves 70.7% mAP and 73.5% NDS on the nuScenes test set, demonstrating its effectiveness and superiority as compared with the state-of-the-art.

查看原文本刊更多论文

SyNet：一种基于几何语义的多交互融合的三维目标检测协同网络

在自动驾驶、机器人等领域日益增长的需求推动下，近年来，通过融合光学图像和LiDAR点数据，3D物体检测取得了很大进展。另一方面，大多数现有的光学-激光雷达融合方法直接覆盖RGB图像和点云，而没有充分利用它们之间的协同作用，导致融合和3D检测性能不理想。此外，如果没有适当地平衡全局和局部对象信息，它们往往会受到定位精度的限制。为了解决这个问题，我们设计了一个协同网络（SyNet），它融合了几何信息、语义信息以及物体的全局和局部信息，以实现鲁棒和准确的3D检测。SyNet从三个角度捕捉光学图像和激光雷达点云之间的协同作用。第一个是几何，它通过将点云投影到多视图图像上来获得高质量的深度，用3D空间信息丰富光学RGB图像，从而更准确地解释图像语义。二是语义化，对点云进行体素化，建立衍生体素与图像像素的对应关系，丰富三维点云的语义信息，实现更精确的三维检测。三是平衡局部和全局目标信息，引入可变形的自注意和交叉注意，对两类互补信息进行并行处理，以获得更精确的目标定位。大量实验表明，SyNet在nuScenes测试集上实现了70.7%的mAP和73.5%的NDS，证明了其与现有技术相比的有效性和优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.