MuTrans: Multiple Transformers for Fusing Feature Pyramid on 2D and 3D Object Detection

IF 13.7
Bangquan Xie;Liang Yang;Ailin Wei;Xiaoxiong Weng;Bing Li
{"title":"MuTrans: Multiple Transformers for Fusing Feature Pyramid on 2D and 3D Object Detection","authors":"Bangquan Xie;Liang Yang;Ailin Wei;Xiaoxiong Weng;Bing Li","doi":"10.1109/TIP.2023.3299190","DOIUrl":null,"url":null,"abstract":"One of the major components of the neural network, the feature pyramid plays a vital part in perception tasks, like object detection in autonomous driving. But it is a challenge to fuse multi-level and multi-sensor feature pyramids for object detection. This paper proposes a simple yet effective framework named MuTrans (Mu ltiple Trans formers) to fuse feature pyramid in single-stream 2D detector or two-stream 3D detector. The MuTrans based on encoder-decoder focuses on the significant features via multiple Transformers. MuTrans encoder uses three innovative self-attention mechanisms: \n<inline-formula> <tex-math>${S}$ </tex-math></inline-formula>\npatial-wise \n<inline-formula> <tex-math>${B}$ </tex-math></inline-formula>\noxAlign attention (SB) for low-level spatial locations, \n<inline-formula> <tex-math>${C}$ </tex-math></inline-formula>\nontext-wise \n<inline-formula> <tex-math>${A}$ </tex-math></inline-formula>\nffinity attention (CA) for high-level context information, and high-level attention for multi-level features. Then MuTrans decoder processes these significant proposals including the RoI and context affinity. Besides, the \n<inline-formula> <tex-math>${L}$ </tex-math></inline-formula>\now and \n<inline-formula> <tex-math>${H}$ </tex-math></inline-formula>\nigh-level \n<inline-formula> <tex-math>${F}$ </tex-math></inline-formula>\nusion (LHF) in the encoder reduces the number of computational parameters. And the Pre-LN is utilized to accelerate the training convergence. LHF and Pre-LN are proven to reduce self-attention’s computational complexity and slow training convergence. Our result demonstrates the higher detection accuracy of MuTrans than that of the baseline method, particularly in small object detection. MuTrans demonstrates a 2.1 higher detection accuracy on \n<inline-formula> <tex-math>$AP_{S}$ </tex-math></inline-formula>\n index in small object detection on MS-COCO 2017 with ResNeXt-101 backbone, a 2.18 higher 3D detection accuracy (moderate difficulty) for small object-pedestrian on KITTI, and 6.85 higher RC index (Town05 Long) on CARLA urban driving simulator platform.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"32 ","pages":"4407-4415"},"PeriodicalIF":13.7000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10198476/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

One of the major components of the neural network, the feature pyramid plays a vital part in perception tasks, like object detection in autonomous driving. But it is a challenge to fuse multi-level and multi-sensor feature pyramids for object detection. This paper proposes a simple yet effective framework named MuTrans (Mu ltiple Trans formers) to fuse feature pyramid in single-stream 2D detector or two-stream 3D detector. The MuTrans based on encoder-decoder focuses on the significant features via multiple Transformers. MuTrans encoder uses three innovative self-attention mechanisms: ${S}$ patial-wise ${B}$ oxAlign attention (SB) for low-level spatial locations, ${C}$ ontext-wise ${A}$ ffinity attention (CA) for high-level context information, and high-level attention for multi-level features. Then MuTrans decoder processes these significant proposals including the RoI and context affinity. Besides, the ${L}$ ow and ${H}$ igh-level ${F}$ usion (LHF) in the encoder reduces the number of computational parameters. And the Pre-LN is utilized to accelerate the training convergence. LHF and Pre-LN are proven to reduce self-attention’s computational complexity and slow training convergence. Our result demonstrates the higher detection accuracy of MuTrans than that of the baseline method, particularly in small object detection. MuTrans demonstrates a 2.1 higher detection accuracy on $AP_{S}$ index in small object detection on MS-COCO 2017 with ResNeXt-101 backbone, a 2.18 higher 3D detection accuracy (moderate difficulty) for small object-pedestrian on KITTI, and 6.85 higher RC index (Town05 Long) on CARLA urban driving simulator platform.
MuTrans:用于在2D和3D对象检测上融合特征金字塔的多个变换器
作为神经网络的主要组成部分之一,特征金字塔在感知任务中起着至关重要的作用,比如自动驾驶中的物体检测。但是,融合多层次和多传感器的特征金字塔进行目标检测是一个挑战。本文提出了一个简单而有效的框架MuTrans(Mu-tiple Trans-foormers),用于融合单流2D检测器或双流3D检测器中的特征金字塔。基于编码器-解码器的MuTrans通过多个变压器来关注其重要功能。MuTrans编码器使用了三种创新的自注意机制:${S}$空间式${B}$oxAlign注意(SB)用于低级别空间位置,${C}$上下文式${A}$finity注意(CA)用于高级别上下文信息,以及高级别注意用于多级别特征。然后MuTrans解码器处理这些重要的建议,包括RoI和上下文亲和性。此外,编码器中的${L}$ow和${H}$high级别的${F}$usion(LHF)减少了计算参数的数量。并利用Pre-LN加速训练收敛。LHF和Pre-LN被证明可以降低自注意的计算复杂性和训练收敛速度。我们的结果表明,MuTrans的检测精度高于基线方法,尤其是在小目标检测中。MuTrans在使用ResNeXt-101主干的MS-COCO 2017上,在小物体检测中,$AP_{S}$指数的检测精度高出2.1,在KITTI上,小物体行人的3D检测精度(中等难度)高出2.18,在CARLA城市驾驶模拟器平台上,RC指数(Town05 Long)高出6.85。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信