RViDeformer：高效的原始视频去噪变压器与更大的基准数据集

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-03-20 DOI:10.1109/TCSVT.2025.3553160

Huanjing Yue;Cong Cao;Lei Liao;Jingyu Yang

{"title":"RViDeformer：高效的原始视频去噪变压器与更大的基准数据集","authors":"Huanjing Yue;Cong Cao;Lei Liao;Jingyu Yang","doi":"10.1109/TCSVT.2025.3553160","DOIUrl":null,"url":null,"abstract":"In recent years, raw video denoising has garnered increased attention due to the consistency with the imaging process and well-studied noise modeling in the raw domain. However, two problems still hinder the denoising performance. Firstly, there is no large dataset with realistic motions for supervised raw video denoising, as capturing noisy and clean frames for real dynamic scenes is difficult. To address this, we propose recapturing existing high-resolution videos displayed on a 4K screen with high-low ISO settings to construct noisy-clean paired frames. In this way, we construct a video denoising dataset (named as ReCRVD) with 120 groups of noisy-clean videos, whose ISO values ranging from 1600 to 25600. Secondly, while non-local temporal-spatial attention is beneficial for denoising, it often leads to heavy computation costs. We propose an efficient raw video denoising transformer network (RViDeformer) that explores both short and long-distance correlations. Specifically, we propose multi-branch spatial and temporal attention modules, which explore the patch correlations from local window, local low-resolution window, global downsampled window, and neighbor-involved window, and then they are fused together. We employ reparameterization to reduce computation costs. Our network is trained in both supervised and unsupervised manners, achieving the best performance compared with state-of-the-art methods. Additionally, the model trained with our proposed dataset (ReCRVD) outperforms the model trained with previous benchmark dataset (CRVD) when evaluated on the real-world outdoor noisy videos. <italic>Our code and dataset will be released.</i>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8929-8944"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RViDeformer: Efficient Raw Video Denoising Transformer With a Larger Benchmark Dataset\",\"authors\":\"Huanjing Yue;Cong Cao;Lei Liao;Jingyu Yang\",\"doi\":\"10.1109/TCSVT.2025.3553160\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, raw video denoising has garnered increased attention due to the consistency with the imaging process and well-studied noise modeling in the raw domain. However, two problems still hinder the denoising performance. Firstly, there is no large dataset with realistic motions for supervised raw video denoising, as capturing noisy and clean frames for real dynamic scenes is difficult. To address this, we propose recapturing existing high-resolution videos displayed on a 4K screen with high-low ISO settings to construct noisy-clean paired frames. In this way, we construct a video denoising dataset (named as ReCRVD) with 120 groups of noisy-clean videos, whose ISO values ranging from 1600 to 25600. Secondly, while non-local temporal-spatial attention is beneficial for denoising, it often leads to heavy computation costs. We propose an efficient raw video denoising transformer network (RViDeformer) that explores both short and long-distance correlations. Specifically, we propose multi-branch spatial and temporal attention modules, which explore the patch correlations from local window, local low-resolution window, global downsampled window, and neighbor-involved window, and then they are fused together. We employ reparameterization to reduce computation costs. Our network is trained in both supervised and unsupervised manners, achieving the best performance compared with state-of-the-art methods. Additionally, the model trained with our proposed dataset (ReCRVD) outperforms the model trained with previous benchmark dataset (CRVD) when evaluated on the real-world outdoor noisy videos. <italic>Our code and dataset will be released.</i>\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 9\",\"pages\":\"8929-8944\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2025-03-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10935691/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10935691/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

近年来，原始视频去噪因其与成像过程的一致性以及对原始域噪声建模的深入研究而受到越来越多的关注。然而，两个问题仍然阻碍了去噪性能。首先，由于很难捕获真实动态场景的噪声和干净帧，因此没有具有真实运动的大型数据集用于监督原始视频去噪。为了解决这个问题，我们建议重新捕获现有的高分辨率视频，在4K屏幕上显示高-低ISO设置，以构建噪音清洁的配对帧。通过这种方式，我们构建了一个包含120组去噪视频的视频去噪数据集（命名为recvd），这些去噪视频的ISO值从1600到25600不等。其次，非局部时空注意虽然有利于去噪，但往往会带来沉重的计算成本。我们提出了一种高效的原始视频去噪变压器网络（RViDeformer），它可以探索短距离和长距离的相关性。具体而言，我们提出了多分支时空关注模块，该模块从局部窗口、局部低分辨率窗口、全局下采样窗口和邻居相关窗口探索斑块相关性，然后将它们融合在一起。我们采用重参数化来减少计算成本。我们的网络以监督和非监督两种方式进行训练，与最先进的方法相比，实现了最佳性能。此外，当在真实的室外噪声视频上进行评估时，用我们提出的数据集（recvd）训练的模型优于用以前的基准数据集（CRVD）训练的模型。我们的代码和数据集将被发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

RViDeformer: Efficient Raw Video Denoising Transformer With a Larger Benchmark Dataset

In recent years, raw video denoising has garnered increased attention due to the consistency with the imaging process and well-studied noise modeling in the raw domain. However, two problems still hinder the denoising performance. Firstly, there is no large dataset with realistic motions for supervised raw video denoising, as capturing noisy and clean frames for real dynamic scenes is difficult. To address this, we propose recapturing existing high-resolution videos displayed on a 4K screen with high-low ISO settings to construct noisy-clean paired frames. In this way, we construct a video denoising dataset (named as ReCRVD) with 120 groups of noisy-clean videos, whose ISO values ranging from 1600 to 25600. Secondly, while non-local temporal-spatial attention is beneficial for denoising, it often leads to heavy computation costs. We propose an efficient raw video denoising transformer network (RViDeformer) that explores both short and long-distance correlations. Specifically, we propose multi-branch spatial and temporal attention modules, which explore the patch correlations from local window, local low-resolution window, global downsampled window, and neighbor-involved window, and then they are fused together. We employ reparameterization to reduce computation costs. Our network is trained in both supervised and unsupervised manners, achieving the best performance compared with state-of-the-art methods. Additionally, the model trained with our proposed dataset (ReCRVD) outperforms the model trained with previous benchmark dataset (CRVD) when evaluated on the real-world outdoor noisy videos. Our code and dataset will be released.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.