用于室内 3D 物体检测的图像注意变换器网络

IF 4.9 2区工程技术 Q1 ENGINEERING, MULTIDISCIPLINARY

Science China Technological Sciences Pub Date : 2024-06-26 DOI:10.1007/s11431-023-2552-x

KeYan Ren, Tong Yan, ZhaoXin Hu, HongGui Han, YunLu Zhang

{"title":"用于室内 3D 物体检测的图像注意变换器网络","authors":"KeYan Ren, Tong Yan, ZhaoXin Hu, HongGui Han, YunLu Zhang","doi":"10.1007/s11431-023-2552-x","DOIUrl":null,"url":null,"abstract":"<p>Point clouds and RGB images are both critical data for 3D object detection. While recent multi-modal methods combine them directly and show remarkable performances, they ignore the distinct forms of these two types of data. For mitigating the influence of this intrinsic difference on performance, we propose a novel but effective fusion model named LI-Attention model, which takes both RGB features and point cloud features into consideration and assigns a weight to each RGB feature by attention mechanism. Furthermore, based on the LI-Attention model, we propose a 3D object detection method called image attention transformer network (IAT-Net) specialized for indoor RGB-D scene. Compared with previous work on multi-modal detection, IAT-Net fuses elaborate RGB features from 2D detection results with point cloud features in attention mechanism, meanwhile generates and refines 3D detection results with transformer model. Extensive experiments demonstrate that our approach outperforms state-of-the-art performance on two widely used benchmarks of indoor 3D object detection, SUN RGB-D and NYU Depth V2, while ablation studies have been provided to analyze the effect of each module. And the source code for the proposed IAT-Net is publicly available at https://github.com/wisper181/IAT-Net.</p>","PeriodicalId":21612,"journal":{"name":"Science China Technological Sciences","volume":"32 1","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Image attention transformer network for indoor 3D object detection\",\"authors\":\"KeYan Ren, Tong Yan, ZhaoXin Hu, HongGui Han, YunLu Zhang\",\"doi\":\"10.1007/s11431-023-2552-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Point clouds and RGB images are both critical data for 3D object detection. While recent multi-modal methods combine them directly and show remarkable performances, they ignore the distinct forms of these two types of data. For mitigating the influence of this intrinsic difference on performance, we propose a novel but effective fusion model named LI-Attention model, which takes both RGB features and point cloud features into consideration and assigns a weight to each RGB feature by attention mechanism. Furthermore, based on the LI-Attention model, we propose a 3D object detection method called image attention transformer network (IAT-Net) specialized for indoor RGB-D scene. Compared with previous work on multi-modal detection, IAT-Net fuses elaborate RGB features from 2D detection results with point cloud features in attention mechanism, meanwhile generates and refines 3D detection results with transformer model. Extensive experiments demonstrate that our approach outperforms state-of-the-art performance on two widely used benchmarks of indoor 3D object detection, SUN RGB-D and NYU Depth V2, while ablation studies have been provided to analyze the effect of each module. And the source code for the proposed IAT-Net is publicly available at https://github.com/wisper181/IAT-Net.</p>\",\"PeriodicalId\":21612,\"journal\":{\"name\":\"Science China Technological Sciences\",\"volume\":\"32 1\",\"pages\":\"\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2024-06-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Science China Technological Sciences\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1007/s11431-023-2552-x\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science China Technological Sciences","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1007/s11431-023-2552-x","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

点云和 RGB 图像都是三维物体检测的关键数据。虽然最近的多模态方法将它们直接结合在一起并显示出显著的性能，但它们忽略了这两类数据的不同形式。为了减少这种内在差异对性能的影响，我们提出了一种新颖而有效的融合模型，名为 "LI-Attention 模型"，它同时考虑了 RGB 特征和点云特征，并通过注意力机制为每个 RGB 特征分配权重。此外，基于 LI-Attention 模型，我们提出了一种专门用于室内 RGB-D 场景的三维物体检测方法，即图像注意力转换网络（IAT-Net）。与之前的多模态检测方法相比，IAT-Net 将二维检测结果中精心制作的 RGB 特征与注意力机制中的点云特征相融合，同时利用变换器模型生成并完善三维检测结果。大量实验证明，我们的方法在两个广泛使用的室内 3D 物体检测基准（SUN RGB-D 和 NYU Depth V2）上的性能优于最先进的方法。建议的 IAT-Net 的源代码可在 https://github.com/wisper181/IAT-Net 上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Image attention transformer network for indoor 3D object detection

Point clouds and RGB images are both critical data for 3D object detection. While recent multi-modal methods combine them directly and show remarkable performances, they ignore the distinct forms of these two types of data. For mitigating the influence of this intrinsic difference on performance, we propose a novel but effective fusion model named LI-Attention model, which takes both RGB features and point cloud features into consideration and assigns a weight to each RGB feature by attention mechanism. Furthermore, based on the LI-Attention model, we propose a 3D object detection method called image attention transformer network (IAT-Net) specialized for indoor RGB-D scene. Compared with previous work on multi-modal detection, IAT-Net fuses elaborate RGB features from 2D detection results with point cloud features in attention mechanism, meanwhile generates and refines 3D detection results with transformer model. Extensive experiments demonstrate that our approach outperforms state-of-the-art performance on two widely used benchmarks of indoor 3D object detection, SUN RGB-D and NYU Depth V2, while ablation studies have been provided to analyze the effect of each module. And the source code for the proposed IAT-Net is publicly available at https://github.com/wisper181/IAT-Net.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Science China Technological Sciences ENGINEERING, MULTIDISCIPLINARY-MATERIALS SCIENCE, MULTIDISCIPLINARY

CiteScore

8.40

自引率

10.90%

发文量

4380

审稿时长

3.3 months

期刊介绍： Science China Technological Sciences, an academic journal cosponsored by the Chinese Academy of Sciences and the National Natural Science Foundation of China, and published by Science China Press, is committed to publishing high-quality, original results in both basic and applied research. Science China Technological Sciences is published in both print and electronic forms. It is indexed by Science Citation Index. Categories of articles: Reviews summarize representative results and achievements in a particular topic or an area, comment on the current state of research, and advise on the research directions. The author’s own opinion and related discussion is requested. Research papers report on important original results in all areas of technological sciences. Brief reports present short reports in a timely manner of the latest important results.