Object Pose Estimation Based on Multi-precision Vectors and Seg-Driven PnP

IF 11.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2024-12-07 DOI:10.1007/s11263-024-02317-y

Yulin Wang, Hongli Li, Chen Luo

{"title":"Object Pose Estimation Based on Multi-precision Vectors and Seg-Driven PnP","authors":"Yulin Wang, Hongli Li, Chen Luo","doi":"10.1007/s11263-024-02317-y","DOIUrl":null,"url":null,"abstract":"<p>Object pose estimation based on a single RGB image has wide application potential but is difficult to achieve. Existing pose estimation involves various inference pipelines. One popular pipeline is to first use Convolutional Neural Networks (CNN) to predict 2D projections of 3D keypoints in a single RGB image and then calculate the 6D pose via a Perspective-n-Point (PnP) solver. Due to the gap between synthetic data and real data, the model trained on synthetic data has difficulty predicting the 6D pose accurately when applied to real data. To address the acute problem, we propose a two-stage pipeline of object pose estimation based upon multi-precision vectors and segmentation-driven (Seg-Driven) PnP. In keypoint localization stage, we first develop a CNN-based three-branch network to predict multi-precision 2D vectors pointing to 2D keypoints. Then we introduce an accurate and fast Keypoint Voting scheme of Multi-precision vectors (KVM), which computes low-precision 2D keypoints using low-precision vectors and refines 2D keypoints on mid- and high-precision vectors. In the pose calculation stage, we propose Seg-Driven PnP to refine the 3D Translation of poses and get the optimal pose by minimizing the non-overlapping area between segmented and rendered masks. The Seg-Driven PnP leverages 2D segmentation trained on real images to improve the accuracy of pose estimation trained on synthetic data, thereby reducing the synthetic-to-real gap. Extensive experiments show our approach materially outperforms state-of-the-art methods on LM and HB datasets. Importantly, our proposed method works reasonably well for weakly textured and occluded objects in diverse scenes.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"6 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2024-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02317-y","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Object pose estimation based on a single RGB image has wide application potential but is difficult to achieve. Existing pose estimation involves various inference pipelines. One popular pipeline is to first use Convolutional Neural Networks (CNN) to predict 2D projections of 3D keypoints in a single RGB image and then calculate the 6D pose via a Perspective-n-Point (PnP) solver. Due to the gap between synthetic data and real data, the model trained on synthetic data has difficulty predicting the 6D pose accurately when applied to real data. To address the acute problem, we propose a two-stage pipeline of object pose estimation based upon multi-precision vectors and segmentation-driven (Seg-Driven) PnP. In keypoint localization stage, we first develop a CNN-based three-branch network to predict multi-precision 2D vectors pointing to 2D keypoints. Then we introduce an accurate and fast Keypoint Voting scheme of Multi-precision vectors (KVM), which computes low-precision 2D keypoints using low-precision vectors and refines 2D keypoints on mid- and high-precision vectors. In the pose calculation stage, we propose Seg-Driven PnP to refine the 3D Translation of poses and get the optimal pose by minimizing the non-overlapping area between segmented and rendered masks. The Seg-Driven PnP leverages 2D segmentation trained on real images to improve the accuracy of pose estimation trained on synthetic data, thereby reducing the synthetic-to-real gap. Extensive experiments show our approach materially outperforms state-of-the-art methods on LM and HB datasets. Importantly, our proposed method works reasonably well for weakly textured and occluded objects in diverse scenes.

查看原文本刊更多论文

基于多精度矢量和分段驱动PnP的目标姿态估计

基于单幅RGB图像的目标姿态估计具有广泛的应用潜力，但实现难度较大。现有的姿态估计涉及各种推理管道。一种流行的方法是首先使用卷积神经网络（CNN）来预测单个RGB图像中3D关键点的2D投影，然后通过Perspective-n-Point （PnP）求解器计算6D姿态。由于合成数据与真实数据之间的差距，使用合成数据训练的模型在应用于真实数据时难以准确预测6D位姿。为了解决这个尖锐的问题，我们提出了一种基于多精度向量和分割驱动（segdriven） PnP的两阶段目标姿态估计管道。在关键点定位阶段，我们首先开发了一种基于cnn的三分支网络来预测指向二维关键点的多精度二维向量。在此基础上，提出了一种精确、快速的多精度矢量关键点投票方案，该方案利用低精度矢量计算低精度二维关键点，并在中高精度矢量上细化二维关键点。在姿态计算阶段，我们提出了分段驱动的PnP算法来细化姿态的三维平移，并通过最小化分割和渲染蒙版之间的非重叠区域来获得最优姿态。Seg-Driven PnP利用在真实图像上训练的2D分割来提高在合成数据上训练的姿态估计的准确性，从而减少合成与真实的差距。大量的实验表明，我们的方法在LM和HB数据集上明显优于最先进的方法。重要的是，我们提出的方法对于不同场景中的弱纹理和遮挡物体都能很好地工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.