Bangquan Xie;Liang Yang;Ailin Wei;Xiaoxiong Weng;Bing Li
{"title":"MuTrans: Multiple Transformers for Fusing Feature Pyramid on 2D and 3D Object Detection","authors":"Bangquan Xie;Liang Yang;Ailin Wei;Xiaoxiong Weng;Bing Li","doi":"10.1109/TIP.2023.3299190","DOIUrl":null,"url":null,"abstract":"One of the major components of the neural network, the feature pyramid plays a vital part in perception tasks, like object detection in autonomous driving. But it is a challenge to fuse multi-level and multi-sensor feature pyramids for object detection. This paper proposes a simple yet effective framework named MuTrans (Mu ltiple Trans formers) to fuse feature pyramid in single-stream 2D detector or two-stream 3D detector. The MuTrans based on encoder-decoder focuses on the significant features via multiple Transformers. MuTrans encoder uses three innovative self-attention mechanisms: \n<inline-formula> <tex-math>${S}$ </tex-math></inline-formula>\npatial-wise \n<inline-formula> <tex-math>${B}$ </tex-math></inline-formula>\noxAlign attention (SB) for low-level spatial locations, \n<inline-formula> <tex-math>${C}$ </tex-math></inline-formula>\nontext-wise \n<inline-formula> <tex-math>${A}$ </tex-math></inline-formula>\nffinity attention (CA) for high-level context information, and high-level attention for multi-level features. Then MuTrans decoder processes these significant proposals including the RoI and context affinity. Besides, the \n<inline-formula> <tex-math>${L}$ </tex-math></inline-formula>\now and \n<inline-formula> <tex-math>${H}$ </tex-math></inline-formula>\nigh-level \n<inline-formula> <tex-math>${F}$ </tex-math></inline-formula>\nusion (LHF) in the encoder reduces the number of computational parameters. And the Pre-LN is utilized to accelerate the training convergence. LHF and Pre-LN are proven to reduce self-attention’s computational complexity and slow training convergence. Our result demonstrates the higher detection accuracy of MuTrans than that of the baseline method, particularly in small object detection. MuTrans demonstrates a 2.1 higher detection accuracy on \n<inline-formula> <tex-math>$AP_{S}$ </tex-math></inline-formula>\n index in small object detection on MS-COCO 2017 with ResNeXt-101 backbone, a 2.18 higher 3D detection accuracy (moderate difficulty) for small object-pedestrian on KITTI, and 6.85 higher RC index (Town05 Long) on CARLA urban driving simulator platform.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"32 ","pages":"4407-4415"},"PeriodicalIF":13.7000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10198476/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
One of the major components of the neural network, the feature pyramid plays a vital part in perception tasks, like object detection in autonomous driving. But it is a challenge to fuse multi-level and multi-sensor feature pyramids for object detection. This paper proposes a simple yet effective framework named MuTrans (Mu ltiple Trans formers) to fuse feature pyramid in single-stream 2D detector or two-stream 3D detector. The MuTrans based on encoder-decoder focuses on the significant features via multiple Transformers. MuTrans encoder uses three innovative self-attention mechanisms:
${S}$
patial-wise
${B}$
oxAlign attention (SB) for low-level spatial locations,
${C}$
ontext-wise
${A}$
ffinity attention (CA) for high-level context information, and high-level attention for multi-level features. Then MuTrans decoder processes these significant proposals including the RoI and context affinity. Besides, the
${L}$
ow and
${H}$
igh-level
${F}$
usion (LHF) in the encoder reduces the number of computational parameters. And the Pre-LN is utilized to accelerate the training convergence. LHF and Pre-LN are proven to reduce self-attention’s computational complexity and slow training convergence. Our result demonstrates the higher detection accuracy of MuTrans than that of the baseline method, particularly in small object detection. MuTrans demonstrates a 2.1 higher detection accuracy on
$AP_{S}$
index in small object detection on MS-COCO 2017 with ResNeXt-101 backbone, a 2.18 higher 3D detection accuracy (moderate difficulty) for small object-pedestrian on KITTI, and 6.85 higher RC index (Town05 Long) on CARLA urban driving simulator platform.