利用时间亲和和扩散先验重构高质量原始视频。

IF 18.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-08-07 DOI:10.1109/tpami.2025.3596623

Wencheng Han,Jianbing Shen,David J Crandall,Cheng-Zhong Xu

{"title":"利用时间亲和和扩散先验重构高质量原始视频。","authors":"Wencheng Han,Jianbing Shen,David J Crandall,Cheng-Zhong Xu","doi":"10.1109/tpami.2025.3596623","DOIUrl":null,"url":null,"abstract":"Due to the rich information and original data distribution, RAW data are widely used in many computer vision applications. However, the use of RAW video remains limited because of the high storage costs associated with data collection. Previous works have attempted to reconstruct RAW frames from sRGB data using small sampled metadata from the original RAW frames. Yet, these algorithms struggle with RAW video reconstruction due to the high computational cost of sampling metadata on cameras. To address these issues, we propose a new RAW video reconstruction pipeline that de-renders high-quality RAW videos from sRGB data using only one initial RAW frame as a reference. Specifically, we introduce three new models to achieve this goal. First, we present the Temporal-Affinity Guided De-rendering Network. This network leverages the temporal affinity between adjacent frames to construct a reference RAW image from previous RAW pixels. The corresponding RAW pixels in the previous frame provide valuable information about the original RAW data distribution, aiding in the precise reconstruction of the current frame. Second, to recover the missing RAW pixels caused by camera and foreground movement, we fully exploit the rich prior information from a pre-trained diffusion model and propose the RAW In-painting Model. This model can accurately fill in hollow regions in a RAW image based on the corresponding sRGB image and the surrounding RAW context. Lastly, we present a lightweight content-aware video clipper that automatically adjusts the clip length used for RAW video reconstruction, thereby balancing storage requirements with reconstruction quality. To better evaluate the performance of the proposed framework across different devices, we introduce the first RAW video reconstruction benchmark that comprises RAW videos from six types of camera devices with challenging scenarios. Experimental results demonstrate that our algorithm can accurately reconstruct RAW videos across all the scenarios. To facilitate further research, the code, pre-trained weight, dataset, and demo web will be publicly available at: https://um-lab.github.io/VideoRAW/.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"32 1","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reconstructing High Quality Raw Video Using Temporal Affinity and Diffusion Prior.\",\"authors\":\"Wencheng Han,Jianbing Shen,David J Crandall,Cheng-Zhong Xu\",\"doi\":\"10.1109/tpami.2025.3596623\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to the rich information and original data distribution, RAW data are widely used in many computer vision applications. However, the use of RAW video remains limited because of the high storage costs associated with data collection. Previous works have attempted to reconstruct RAW frames from sRGB data using small sampled metadata from the original RAW frames. Yet, these algorithms struggle with RAW video reconstruction due to the high computational cost of sampling metadata on cameras. To address these issues, we propose a new RAW video reconstruction pipeline that de-renders high-quality RAW videos from sRGB data using only one initial RAW frame as a reference. Specifically, we introduce three new models to achieve this goal. First, we present the Temporal-Affinity Guided De-rendering Network. This network leverages the temporal affinity between adjacent frames to construct a reference RAW image from previous RAW pixels. The corresponding RAW pixels in the previous frame provide valuable information about the original RAW data distribution, aiding in the precise reconstruction of the current frame. Second, to recover the missing RAW pixels caused by camera and foreground movement, we fully exploit the rich prior information from a pre-trained diffusion model and propose the RAW In-painting Model. This model can accurately fill in hollow regions in a RAW image based on the corresponding sRGB image and the surrounding RAW context. Lastly, we present a lightweight content-aware video clipper that automatically adjusts the clip length used for RAW video reconstruction, thereby balancing storage requirements with reconstruction quality. To better evaluate the performance of the proposed framework across different devices, we introduce the first RAW video reconstruction benchmark that comprises RAW videos from six types of camera devices with challenging scenarios. Experimental results demonstrate that our algorithm can accurately reconstruct RAW videos across all the scenarios. To facilitate further research, the code, pre-trained weight, dataset, and demo web will be publicly available at: https://um-lab.github.io/VideoRAW/.\",\"PeriodicalId\":13426,\"journal\":{\"name\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"volume\":\"32 1\",\"pages\":\"\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2025-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1109/tpami.2025.3596623\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3596623","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

由于信息丰富、数据分布原始，RAW数据被广泛应用于许多计算机视觉应用中。然而，由于与数据收集相关的高存储成本，RAW视频的使用仍然有限。以前的工作尝试使用原始RAW帧的小采样元数据从sRGB数据重建RAW帧。然而，由于在相机上采样元数据的计算成本很高，这些算法在RAW视频重建方面存在困难。为了解决这些问题，我们提出了一种新的RAW视频重建管道，该管道仅使用一个初始RAW帧作为参考，从sRGB数据中还原高质量的RAW视频。具体来说，我们引入了三种新模型来实现这一目标。首先，我们提出了时间关联引导的去渲染网络。该网络利用相邻帧之间的时间亲和性从先前的RAW像素构建参考RAW图像。前一帧中相应的RAW像素提供了原始RAW数据分布的宝贵信息，有助于当前帧的精确重建。其次，为了恢复由于相机和前景移动而丢失的RAW像素，我们充分利用了预训练扩散模型的丰富先验信息，提出了RAW In-painting模型。该模型可以根据相应的sRGB图像和周围的RAW环境准确地填充RAW图像中的空心区域。最后，我们提出了一个轻量级的内容感知视频剪辑器，它可以自动调整用于RAW视频重建的剪辑长度，从而平衡存储需求和重建质量。为了更好地评估所提出的框架在不同设备上的性能，我们引入了第一个RAW视频重建基准，该基准包括来自六种具有挑战性场景的相机设备的RAW视频。实验结果表明，该算法可以在所有场景下准确地重建RAW视频。为了便于进一步的研究，代码、预训练的权重、数据集和演示web将在https://um-lab.github.io/VideoRAW/上公开提供。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Reconstructing High Quality Raw Video Using Temporal Affinity and Diffusion Prior.

Due to the rich information and original data distribution, RAW data are widely used in many computer vision applications. However, the use of RAW video remains limited because of the high storage costs associated with data collection. Previous works have attempted to reconstruct RAW frames from sRGB data using small sampled metadata from the original RAW frames. Yet, these algorithms struggle with RAW video reconstruction due to the high computational cost of sampling metadata on cameras. To address these issues, we propose a new RAW video reconstruction pipeline that de-renders high-quality RAW videos from sRGB data using only one initial RAW frame as a reference. Specifically, we introduce three new models to achieve this goal. First, we present the Temporal-Affinity Guided De-rendering Network. This network leverages the temporal affinity between adjacent frames to construct a reference RAW image from previous RAW pixels. The corresponding RAW pixels in the previous frame provide valuable information about the original RAW data distribution, aiding in the precise reconstruction of the current frame. Second, to recover the missing RAW pixels caused by camera and foreground movement, we fully exploit the rich prior information from a pre-trained diffusion model and propose the RAW In-painting Model. This model can accurately fill in hollow regions in a RAW image based on the corresponding sRGB image and the surrounding RAW context. Lastly, we present a lightweight content-aware video clipper that automatically adjusts the clip length used for RAW video reconstruction, thereby balancing storage requirements with reconstruction quality. To better evaluate the performance of the proposed framework across different devices, we introduce the first RAW video reconstruction benchmark that comprises RAW videos from six types of camera devices with challenging scenarios. Experimental results demonstrate that our algorithm can accurately reconstruct RAW videos across all the scenarios. To facilitate further research, the code, pre-trained weight, dataset, and demo web will be publicly available at: https://um-lab.github.io/VideoRAW/.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.