{"title":"Look at the Sky: Sky-aware Efficient 3D Gaussian Splatting in the Wild.","authors":"Yuze Wang, Junyi Wang, Ruicheng Gao, Yansong Qu, Wantong Duan, Shuo Yang, Yue Qi","doi":"10.1109/TVCG.2025.3549187","DOIUrl":null,"url":null,"abstract":"<p><p>Photos taken in unconstrained tourist environments often present challenges for accurate 3D scene reconstruction due to variable appearances and transient occlusions, which can introduce artifacts in novel view synthesis. Recently, in-the-wild 3D scene reconstruction has been achieved realistic rendering with Neural Radiance Fields (NeRFs). With the advancement of 3D Gaussian Splatting (3DGS), some methods also attempt to reconstruct 3D scenes from unconstrained photo collections and achieve real-time rendering. However, the rapid convergence of 3DGS is misaligned with the slower convergence of neural network-based appearance encoder and transient mask predictor, hindering the reconstruction efficiency. To address this, we propose a novel sky-aware framework for scene reconstruction from unconstrained photo collection using 3DGS. Firstly, we observe that the learnable per-image transient mask predictor in previous work is unnecessary. By introducing a simple yet efficient greedy supervision strategy, we directly utilize the pseudo mask generated by a pre-trained semantic segmentation network as the transient mask, thereby achieving more efficient and higher quality in-the-wild 3D scene reconstruction. Secondly, we find that separately estimating appearance embeddings for the sky and building significantly improves reconstruction efficiency and accuracy. We analyze the underlying reasons and introduce a neural sky module to generate diverse skies from latent sky embeddings extract from unconstrained images. Finally, we propose a mutual distillation learning strategy to constrain sky and building appearance embeddings within the same latent space, further enhancing reconstruction efficiency and quality. Extensive experiments on multiple datasets demonstrate that the proposed framework outperforms existing methods in novel view and appearance synthesis, offering superior rendering quality with faster convergence and rendering speed.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TVCG.2025.3549187","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Photos taken in unconstrained tourist environments often present challenges for accurate 3D scene reconstruction due to variable appearances and transient occlusions, which can introduce artifacts in novel view synthesis. Recently, in-the-wild 3D scene reconstruction has been achieved realistic rendering with Neural Radiance Fields (NeRFs). With the advancement of 3D Gaussian Splatting (3DGS), some methods also attempt to reconstruct 3D scenes from unconstrained photo collections and achieve real-time rendering. However, the rapid convergence of 3DGS is misaligned with the slower convergence of neural network-based appearance encoder and transient mask predictor, hindering the reconstruction efficiency. To address this, we propose a novel sky-aware framework for scene reconstruction from unconstrained photo collection using 3DGS. Firstly, we observe that the learnable per-image transient mask predictor in previous work is unnecessary. By introducing a simple yet efficient greedy supervision strategy, we directly utilize the pseudo mask generated by a pre-trained semantic segmentation network as the transient mask, thereby achieving more efficient and higher quality in-the-wild 3D scene reconstruction. Secondly, we find that separately estimating appearance embeddings for the sky and building significantly improves reconstruction efficiency and accuracy. We analyze the underlying reasons and introduce a neural sky module to generate diverse skies from latent sky embeddings extract from unconstrained images. Finally, we propose a mutual distillation learning strategy to constrain sky and building appearance embeddings within the same latent space, further enhancing reconstruction efficiency and quality. Extensive experiments on multiple datasets demonstrate that the proposed framework outperforms existing methods in novel view and appearance synthesis, offering superior rendering quality with faster convergence and rendering speed.