通过以对象为中心的体素化和神经渲染来理解动态场景

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-02-06 DOI:10.1109/TPAMI.2025.3539866

Yanpeng Zhao;Yiwei Hao;Siyu Gao;Yunbo Wang;Xiaokang Yang

{"title":"通过以对象为中心的体素化和神经渲染来理解动态场景","authors":"Yanpeng Zhao;Yiwei Hao;Siyu Gao;Yunbo Wang;Xiaokang Yang","doi":"10.1109/TPAMI.2025.3539866","DOIUrl":null,"url":null,"abstract":"Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers per-object occupancy probabilities at individual spatial locations. These voxel features evolve through a canonical-space deformation function and are optimized in an inverse rendering pipeline with a compositional NeRF. Additionally, our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids. DynaVol-S significantly outperforms existing models in both novel view synthesis and unsupervised decomposition tasks for dynamic scenes. By jointly considering geometric structures and semantic features, it effectively addresses challenging real-world scenarios involving complex object interactions. Furthermore, once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve, such as novel scene generation through editing geometric shapes or manipulating the motion trajectories of objects.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"4215-4231"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10877772","citationCount":"0","resultStr":"{\"title\":\"Dynamic Scene Understanding Through Object-Centric Voxelization and Neural Rendering\",\"authors\":\"Yanpeng Zhao;Yiwei Hao;Siyu Gao;Yunbo Wang;Xiaokang Yang\",\"doi\":\"10.1109/TPAMI.2025.3539866\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers per-object occupancy probabilities at individual spatial locations. These voxel features evolve through a canonical-space deformation function and are optimized in an inverse rendering pipeline with a compositional NeRF. Additionally, our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids. DynaVol-S significantly outperforms existing models in both novel view synthesis and unsupervised decomposition tasks for dynamic scenes. By jointly considering geometric structures and semantic features, it effectively addresses challenging real-world scenarios involving complex object interactions. Furthermore, once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve, such as novel scene generation through editing geometric shapes or manipulating the motion trajectories of objects.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"47 5\",\"pages\":\"4215-4231\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-02-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10877772\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10877772/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10877772/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

从无监督视频中学习以对象为中心的表示是具有挑战性的。与之前大多数专注于分解2D图像的方法不同，我们为动态场景提出了一个名为DynaVol-S的3D生成模型，该模型可以在可微分体渲染框架内实现以对象为中心的学习。关键思想是执行以物体为中心的体素化来捕捉场景的3D性质，从而推断每个物体在单个空间位置的占用概率。这些体素特征通过标准空间变形函数进化，并在具有合成NeRF的反向渲染管道中进行优化。此外，我们的方法集成了2D语义特征来创建3D语义网格，通过多个解纠缠的体素网格来表示场景。在动态场景的新视图合成和无监督分解任务中，DynaVol-S显著优于现有模型。通过联合考虑几何结构和语义特征，它有效地解决了涉及复杂对象交互的具有挑战性的现实世界场景。此外，一旦训练，明确有意义的体素特征可以实现2D场景分解方法无法实现的额外功能，例如通过编辑几何形状或操纵物体的运动轨迹来生成新的场景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dynamic Scene Understanding Through Object-Centric Voxelization and Neural Rendering

Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers per-object occupancy probabilities at individual spatial locations. These voxel features evolve through a canonical-space deformation function and are optimized in an inverse rendering pipeline with a compositional NeRF. Additionally, our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids. DynaVol-S significantly outperforms existing models in both novel view synthesis and unsupervised decomposition tasks for dynamic scenes. By jointly considering geometric structures and semantic features, it effectively addresses challenging real-world scenarios involving complex object interactions. Furthermore, once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve, such as novel scene generation through editing geometric shapes or manipulating the motion trajectories of objects.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量