FaVoR: Features via Voxel Rendering for Camera Relocalization

arXiv - CS - Robotics Pub Date : 2024-09-11 DOI:arxiv-2409.07571

Vincenzo Polizzi, Marco Cannici, Davide Scaramuzza, Jonathan Kelly

{"title":"FaVoR: Features via Voxel Rendering for Camera Relocalization","authors":"Vincenzo Polizzi, Marco Cannici, Davide Scaramuzza, Jonathan Kelly","doi":"arxiv-2409.07571","DOIUrl":null,"url":null,"abstract":"Camera relocalization methods range from dense image alignment to direct\ncamera pose regression from a query image. Among these, sparse feature matching\nstands out as an efficient, versatile, and generally lightweight approach with\nnumerous applications. However, feature-based methods often struggle with\nsignificant viewpoint and appearance changes, leading to matching failures and\ninaccurate pose estimates. To overcome this limitation, we propose a novel\napproach that leverages a globally sparse yet locally dense 3D representation\nof 2D features. By tracking and triangulating landmarks over a sequence of\nframes, we construct a sparse voxel map optimized to render image patch\ndescriptors observed during tracking. Given an initial pose estimate, we first\nsynthesize descriptors from the voxels using volumetric rendering and then\nperform feature matching to estimate the camera pose. This methodology enables\nthe generation of descriptors for unseen views, enhancing robustness to view\nchanges. We extensively evaluate our method on the 7-Scenes and Cambridge\nLandmarks datasets. Our results show that our method significantly outperforms\nexisting state-of-the-art feature representation techniques in indoor\nenvironments, achieving up to a 39% improvement in median translation error.\nAdditionally, our approach yields comparable results to other methods for\noutdoor scenarios while maintaining lower memory and computational costs.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"59 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Robotics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07571","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Camera relocalization methods range from dense image alignment to direct camera pose regression from a query image. Among these, sparse feature matching stands out as an efficient, versatile, and generally lightweight approach with numerous applications. However, feature-based methods often struggle with significant viewpoint and appearance changes, leading to matching failures and inaccurate pose estimates. To overcome this limitation, we propose a novel approach that leverages a globally sparse yet locally dense 3D representation of 2D features. By tracking and triangulating landmarks over a sequence of frames, we construct a sparse voxel map optimized to render image patch descriptors observed during tracking. Given an initial pose estimate, we first synthesize descriptors from the voxels using volumetric rendering and then perform feature matching to estimate the camera pose. This methodology enables the generation of descriptors for unseen views, enhancing robustness to view changes. We extensively evaluate our method on the 7-Scenes and Cambridge Landmarks datasets. Our results show that our method significantly outperforms existing state-of-the-art feature representation techniques in indoor environments, achieving up to a 39% improvement in median translation error. Additionally, our approach yields comparable results to other methods for outdoor scenarios while maintaining lower memory and computational costs.

查看原文本刊更多论文

FaVoR：通过体素渲染实现摄像机重定位的功能

相机重新定位的方法多种多样，从密集图像配准到根据查询图像直接进行相机姿态回归。其中，稀疏特征匹配是一种高效、多用途、轻量级的方法，应用广泛。然而，基于特征的方法往往难以应对显著的视角和外观变化，导致匹配失败和姿势估计不准确。为了克服这一局限性，我们提出了一种新颖的方法，利用全局稀疏但局部密集的二维特征三维表示。通过对一系列帧中的地标进行跟踪和三角测量，我们构建了一个稀疏的体素图，并对其进行了优化，以呈现跟踪过程中观察到的图像斑块描述符。给定初始姿态估计值后，我们首先使用体积渲染技术从体素中合成描述符，然后进行特征匹配以估计摄像机姿态。这种方法可以生成未见视图的描述符，增强了对视图变化的鲁棒性。我们在 7-Scenes 和 CambridgeLandmarks 数据集上广泛评估了我们的方法。结果表明，在室内环境中，我们的方法明显优于现有的最先进的特征表示技术，翻译误差中值提高了 39%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Robotics

自引率

0.00%

发文量