PlaneRecTR++：用于关节三维平面重建和姿态估计的统一查询学习。

IF 18.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-09-16 DOI:10.1109/tpami.2025.3610500

Jingjia Shi,Shuaifeng Zhi,Kai Xu

{"title":"PlaneRecTR++：用于关节三维平面重建和姿态估计的统一查询学习。","authors":"Jingjia Shi,Shuaifeng Zhi,Kai Xu","doi":"10.1109/tpami.2025.3610500","DOIUrl":null,"url":null,"abstract":"3D plane reconstruction from images can usually be divided into several sub-tasks of plane detection, segmentation, parameters regression and possibly depth prediction for per-frame, along with plane correspondence and relative camera pose estimation between frames. Previous works tend to divide and conquer these sub-tasks with distinct network modules, overall formulated by a two-stage paradigm. With an initial camera pose and per-frame plane predictions provided from the first stage, exclusively designed modules, potentially relying on extra plane correspondence labelling, are applied to merge multi-view plane entities and produce 6DoF camera pose. As none of existing works manage to integrate above closely related sub-tasks into a unified framework but treat them separately and sequentially, we suspect it potentially as a main source of performance limitation for existing approaches. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all sub-tasks related to multi-view reconstruction and pose estimation with a compact single-stage model, refraining from initial pose estimation and plane correspondence supervision. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, obtaining a new state-of-the-art performance on public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"9 1","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PlaneRecTR++: Unified Query Learning for Joint 3D Planar Reconstruction and Pose Estimation.\",\"authors\":\"Jingjia Shi,Shuaifeng Zhi,Kai Xu\",\"doi\":\"10.1109/tpami.2025.3610500\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"3D plane reconstruction from images can usually be divided into several sub-tasks of plane detection, segmentation, parameters regression and possibly depth prediction for per-frame, along with plane correspondence and relative camera pose estimation between frames. Previous works tend to divide and conquer these sub-tasks with distinct network modules, overall formulated by a two-stage paradigm. With an initial camera pose and per-frame plane predictions provided from the first stage, exclusively designed modules, potentially relying on extra plane correspondence labelling, are applied to merge multi-view plane entities and produce 6DoF camera pose. As none of existing works manage to integrate above closely related sub-tasks into a unified framework but treat them separately and sequentially, we suspect it potentially as a main source of performance limitation for existing approaches. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all sub-tasks related to multi-view reconstruction and pose estimation with a compact single-stage model, refraining from initial pose estimation and plane correspondence supervision. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, obtaining a new state-of-the-art performance on public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.\",\"PeriodicalId\":13426,\"journal\":{\"name\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"volume\":\"9 1\",\"pages\":\"\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tpami.2025.3610500\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3610500","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

从图像中重建三维平面通常可以分为平面检测、分割、参数回归和可能的每帧深度预测等几个子任务，以及帧之间的平面对应和相对相机姿态估计。以前的作品倾向于用不同的网络模块来划分和征服这些子任务，总体上采用两阶段范式。通过第一阶段提供的初始相机姿态和每帧平面预测，专门设计的模块（可能依赖于额外的平面对应标记）用于合并多视图平面实体并产生6DoF相机姿态。由于现有的工作都没有设法将上述密切相关的子任务集成到一个统一的框架中，而是分别和顺序地对待它们，我们怀疑这可能是现有方法性能限制的主要来源。基于这一发现以及基于查询的学习在丰富语义实体之间推理方面的成功，本文提出了PlaneRecTR++，这是一种基于transformer的架构，首次将与多视图重构和姿态估计相关的所有子任务统一为一个紧凑的单阶段模型，避免了初始姿态估计和面对应监督。大量的定量和定性实验表明，我们提出的统一学习在子任务之间实现了互利，在公共ScanNetv1， ScanNetv2， NYUv2-Plane和MatterPort3D数据集上获得了新的最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PlaneRecTR++: Unified Query Learning for Joint 3D Planar Reconstruction and Pose Estimation.

3D plane reconstruction from images can usually be divided into several sub-tasks of plane detection, segmentation, parameters regression and possibly depth prediction for per-frame, along with plane correspondence and relative camera pose estimation between frames. Previous works tend to divide and conquer these sub-tasks with distinct network modules, overall formulated by a two-stage paradigm. With an initial camera pose and per-frame plane predictions provided from the first stage, exclusively designed modules, potentially relying on extra plane correspondence labelling, are applied to merge multi-view plane entities and produce 6DoF camera pose. As none of existing works manage to integrate above closely related sub-tasks into a unified framework but treat them separately and sequentially, we suspect it potentially as a main source of performance limitation for existing approaches. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR++, a Transformer-based architecture, which for the first time unifies all sub-tasks related to multi-view reconstruction and pose estimation with a compact single-stage model, refraining from initial pose estimation and plane correspondence supervision. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across sub-tasks, obtaining a new state-of-the-art performance on public ScanNetv1, ScanNetv2, NYUv2-Plane, and MatterPort3D datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.