Xuan Fan, Tao An, Hongbo Gao, Tao Xie, Lijun Zhao, Ruifeng Li
{"title":"DBF-Net:一种基于稀疏线性变压器的6D目标姿态估计深度双向融合网络","authors":"Xuan Fan, Tao An, Hongbo Gao, Tao Xie, Lijun Zhao, Ruifeng Li","doi":"10.1002/aisy.202401001","DOIUrl":null,"url":null,"abstract":"<p>6D object pose estimation, a critical component in computer vision and robotics domains, involves determining the 3D location and orientation of an object relative to a canonical reference frame. Recently, the widespread proliferation of RGB-D sensors has precipitated a marked increase in interest towards 6D pose estimation leveraging RGB-D data. A deep bidirectional fusion network is developed, DBF-Net, achieving efficient yet accurate 6D object pose estimation. Specifically, a sparse linear Transformer (SLT) with linear computation complexity is introduced to effectively leverage cross-modal semantic resemblance during the feature extraction stage, such that it fully models semantic associations between various modalities and efficiently aggregates the globally enhanced features of each modality. Once acquiring two feature representations from two modalities, a feature balancer (FB) based on SLT is proposed to adaptively reconcile the importance of these feature representations. Leveraging the global receptive field of SLT, FB effectively eliminates the ambiguity induced by visual similarity in appearance representation or depth missing of reflective surfaces in geometry representations, thereby enhancing the generalization ability and robustness of the network. Experimental results demonstrate that DBF-Net surpasses current state-of-the-art works by nontrivial margins across multiple benchmarks. The code is available at https://github.com/Mrfanxuan/dbf_net.</p>","PeriodicalId":93858,"journal":{"name":"Advanced intelligent systems (Weinheim an der Bergstrasse, Germany)","volume":"7 8","pages":""},"PeriodicalIF":6.1000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://advanced.onlinelibrary.wiley.com/doi/epdf/10.1002/aisy.202401001","citationCount":"0","resultStr":"{\"title\":\"DBF-Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer\",\"authors\":\"Xuan Fan, Tao An, Hongbo Gao, Tao Xie, Lijun Zhao, Ruifeng Li\",\"doi\":\"10.1002/aisy.202401001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>6D object pose estimation, a critical component in computer vision and robotics domains, involves determining the 3D location and orientation of an object relative to a canonical reference frame. Recently, the widespread proliferation of RGB-D sensors has precipitated a marked increase in interest towards 6D pose estimation leveraging RGB-D data. A deep bidirectional fusion network is developed, DBF-Net, achieving efficient yet accurate 6D object pose estimation. Specifically, a sparse linear Transformer (SLT) with linear computation complexity is introduced to effectively leverage cross-modal semantic resemblance during the feature extraction stage, such that it fully models semantic associations between various modalities and efficiently aggregates the globally enhanced features of each modality. Once acquiring two feature representations from two modalities, a feature balancer (FB) based on SLT is proposed to adaptively reconcile the importance of these feature representations. Leveraging the global receptive field of SLT, FB effectively eliminates the ambiguity induced by visual similarity in appearance representation or depth missing of reflective surfaces in geometry representations, thereby enhancing the generalization ability and robustness of the network. Experimental results demonstrate that DBF-Net surpasses current state-of-the-art works by nontrivial margins across multiple benchmarks. The code is available at https://github.com/Mrfanxuan/dbf_net.</p>\",\"PeriodicalId\":93858,\"journal\":{\"name\":\"Advanced intelligent systems (Weinheim an der Bergstrasse, Germany)\",\"volume\":\"7 8\",\"pages\":\"\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2025-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://advanced.onlinelibrary.wiley.com/doi/epdf/10.1002/aisy.202401001\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advanced intelligent systems (Weinheim an der Bergstrasse, Germany)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://advanced.onlinelibrary.wiley.com/doi/10.1002/aisy.202401001\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advanced intelligent systems (Weinheim an der Bergstrasse, Germany)","FirstCategoryId":"1085","ListUrlMain":"https://advanced.onlinelibrary.wiley.com/doi/10.1002/aisy.202401001","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
DBF-Net: A Deep Bidirectional Fusion Network for 6D Object Pose Estimation with Sparse Linear Transformer
6D object pose estimation, a critical component in computer vision and robotics domains, involves determining the 3D location and orientation of an object relative to a canonical reference frame. Recently, the widespread proliferation of RGB-D sensors has precipitated a marked increase in interest towards 6D pose estimation leveraging RGB-D data. A deep bidirectional fusion network is developed, DBF-Net, achieving efficient yet accurate 6D object pose estimation. Specifically, a sparse linear Transformer (SLT) with linear computation complexity is introduced to effectively leverage cross-modal semantic resemblance during the feature extraction stage, such that it fully models semantic associations between various modalities and efficiently aggregates the globally enhanced features of each modality. Once acquiring two feature representations from two modalities, a feature balancer (FB) based on SLT is proposed to adaptively reconcile the importance of these feature representations. Leveraging the global receptive field of SLT, FB effectively eliminates the ambiguity induced by visual similarity in appearance representation or depth missing of reflective surfaces in geometry representations, thereby enhancing the generalization ability and robustness of the network. Experimental results demonstrate that DBF-Net surpasses current state-of-the-art works by nontrivial margins across multiple benchmarks. The code is available at https://github.com/Mrfanxuan/dbf_net.