StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, Ying Shan
{"title":"StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos","authors":"Sijie Zhao, Wenbo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, Ying Shan","doi":"arxiv-2409.07447","DOIUrl":null,"url":null,"abstract":"This paper presents a novel framework for converting 2D videos to immersive\nstereoscopic 3D, addressing the growing demand for 3D content in immersive\nexperience. Leveraging foundation models as priors, our approach overcomes the\nlimitations of traditional methods and boosts the performance to ensure the\nhigh-fidelity generation required by the display devices. The proposed system\nconsists of two main steps: depth-based video splatting for warping and\nextracting occlusion mask, and stereo video inpainting. We utilize pre-trained\nstable video diffusion as the backbone and introduce a fine-tuning protocol for\nthe stereo video inpainting task. To handle input video with varying lengths\nand resolutions, we explore auto-regressive strategies and tiled processing.\nFinally, a sophisticated data processing pipeline has been developed to\nreconstruct a large-scale and high-quality dataset to support our training. Our\nframework demonstrates significant improvements in 2D-to-3D video conversion,\noffering a practical solution for creating immersive content for 3D devices\nlike Apple Vision Pro and 3D displays. In summary, this work contributes to the\nfield by presenting an effective method for generating high-quality\nstereoscopic videos from monocular input, potentially transforming how we\nexperience digital media.","PeriodicalId":501174,"journal":{"name":"arXiv - CS - Graphics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07447","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience. Leveraging foundation models as priors, our approach overcomes the limitations of traditional methods and boosts the performance to ensure the high-fidelity generation required by the display devices. The proposed system consists of two main steps: depth-based video splatting for warping and extracting occlusion mask, and stereo video inpainting. We utilize pre-trained stable video diffusion as the backbone and introduce a fine-tuning protocol for the stereo video inpainting task. To handle input video with varying lengths and resolutions, we explore auto-regressive strategies and tiled processing. Finally, a sophisticated data processing pipeline has been developed to reconstruct a large-scale and high-quality dataset to support our training. Our framework demonstrates significant improvements in 2D-to-3D video conversion, offering a practical solution for creating immersive content for 3D devices like Apple Vision Pro and 3D displays. In summary, this work contributes to the field by presenting an effective method for generating high-quality stereoscopic videos from monocular input, potentially transforming how we experience digital media.
StereoCrafter:从单目视频生成基于扩散的长尺寸高保真立体三维图像
本文提出了一种将 2D 视频转换为沉浸式立体 3D 的新型框架,以满足沉浸式体验对 3D 内容日益增长的需求。利用基础模型作为先验,我们的方法克服了传统方法的局限性,并提高了性能,以确保显示设备所需的高保真生成。我们提出的系统包括两个主要步骤:基于深度的视频拼接(用于扭曲和提取遮挡)和立体视频内绘。我们利用预训练的稳定视频扩散作为骨干,并为立体视频绘制任务引入了微调协议。为了处理不同长度和分辨率的输入视频,我们探索了自动回归策略和平铺处理方法。最后,我们开发了一个复杂的数据处理管道,以重建一个大规模、高质量的数据集来支持我们的训练。我们的框架在 2D 到 3D 视频转换方面取得了重大改进,为苹果 Vision Pro 等 3D 设备和 3D 显示器创建身临其境的内容提供了实用的解决方案。总之,这项工作提出了一种从单眼输入生成高质量立体视频的有效方法,可能会改变我们体验数字媒体的方式,从而为该领域做出贡献。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
文献相关原料
公司名称 产品信息 采购帮参考价格
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信