Vincenzo Polizzi, Marco Cannici, Davide Scaramuzza, Jonathan Kelly
{"title":"FaVoR:通过体素渲染实现摄像机重定位的功能","authors":"Vincenzo Polizzi, Marco Cannici, Davide Scaramuzza, Jonathan Kelly","doi":"arxiv-2409.07571","DOIUrl":null,"url":null,"abstract":"Camera relocalization methods range from dense image alignment to direct\ncamera pose regression from a query image. Among these, sparse feature matching\nstands out as an efficient, versatile, and generally lightweight approach with\nnumerous applications. However, feature-based methods often struggle with\nsignificant viewpoint and appearance changes, leading to matching failures and\ninaccurate pose estimates. To overcome this limitation, we propose a novel\napproach that leverages a globally sparse yet locally dense 3D representation\nof 2D features. By tracking and triangulating landmarks over a sequence of\nframes, we construct a sparse voxel map optimized to render image patch\ndescriptors observed during tracking. Given an initial pose estimate, we first\nsynthesize descriptors from the voxels using volumetric rendering and then\nperform feature matching to estimate the camera pose. This methodology enables\nthe generation of descriptors for unseen views, enhancing robustness to view\nchanges. We extensively evaluate our method on the 7-Scenes and Cambridge\nLandmarks datasets. Our results show that our method significantly outperforms\nexisting state-of-the-art feature representation techniques in indoor\nenvironments, achieving up to a 39% improvement in median translation error.\nAdditionally, our approach yields comparable results to other methods for\noutdoor scenarios while maintaining lower memory and computational costs.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"FaVoR: Features via Voxel Rendering for Camera Relocalization\",\"authors\":\"Vincenzo Polizzi, Marco Cannici, Davide Scaramuzza, Jonathan Kelly\",\"doi\":\"arxiv-2409.07571\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Camera relocalization methods range from dense image alignment to direct\\ncamera pose regression from a query image. Among these, sparse feature matching\\nstands out as an efficient, versatile, and generally lightweight approach with\\nnumerous applications. However, feature-based methods often struggle with\\nsignificant viewpoint and appearance changes, leading to matching failures and\\ninaccurate pose estimates. To overcome this limitation, we propose a novel\\napproach that leverages a globally sparse yet locally dense 3D representation\\nof 2D features. By tracking and triangulating landmarks over a sequence of\\nframes, we construct a sparse voxel map optimized to render image patch\\ndescriptors observed during tracking. Given an initial pose estimate, we first\\nsynthesize descriptors from the voxels using volumetric rendering and then\\nperform feature matching to estimate the camera pose. This methodology enables\\nthe generation of descriptors for unseen views, enhancing robustness to view\\nchanges. We extensively evaluate our method on the 7-Scenes and Cambridge\\nLandmarks datasets. Our results show that our method significantly outperforms\\nexisting state-of-the-art feature representation techniques in indoor\\nenvironments, achieving up to a 39% improvement in median translation error.\\nAdditionally, our approach yields comparable results to other methods for\\noutdoor scenarios while maintaining lower memory and computational costs.\",\"PeriodicalId\":501031,\"journal\":{\"name\":\"arXiv - CS - Robotics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Robotics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07571\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Robotics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07571","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
FaVoR: Features via Voxel Rendering for Camera Relocalization
Camera relocalization methods range from dense image alignment to direct
camera pose regression from a query image. Among these, sparse feature matching
stands out as an efficient, versatile, and generally lightweight approach with
numerous applications. However, feature-based methods often struggle with
significant viewpoint and appearance changes, leading to matching failures and
inaccurate pose estimates. To overcome this limitation, we propose a novel
approach that leverages a globally sparse yet locally dense 3D representation
of 2D features. By tracking and triangulating landmarks over a sequence of
frames, we construct a sparse voxel map optimized to render image patch
descriptors observed during tracking. Given an initial pose estimate, we first
synthesize descriptors from the voxels using volumetric rendering and then
perform feature matching to estimate the camera pose. This methodology enables
the generation of descriptors for unseen views, enhancing robustness to view
changes. We extensively evaluate our method on the 7-Scenes and Cambridge
Landmarks datasets. Our results show that our method significantly outperforms
existing state-of-the-art feature representation techniques in indoor
environments, achieving up to a 39% improvement in median translation error.
Additionally, our approach yields comparable results to other methods for
outdoor scenarios while maintaining lower memory and computational costs.