Raformer: Redundancy-Aware Transformer for Video Wire Inpainting

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-03-17 DOI:10.1109/TIP.2025.3550038

Zhong Ji;Yimu Su;Yan Zhang;Jiacheng Hou;Yanwei Pang;Jungong Han

{"title":"Raformer: Redundancy-Aware Transformer for Video Wire Inpainting","authors":"Zhong Ji;Yimu Su;Yan Zhang;Jiacheng Hou;Yanwei Pang;Jungong Han","doi":"10.1109/TIP.2025.3550038","DOIUrl":null,"url":null,"abstract":"Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series, offering significant time and labor savings compared to manual frame-by-frame removal. However, wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks, and often intersecting with people and background objects irregularly, which adds complexity to the inpainting process. Recognizing the limitations posed by existing video wire datasets, which are characterized by their small size, poor quality, and limited variety of scenes, we introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video Dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks. WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models. Building upon this, our research proposes the Redundancy-Aware Transformer (Raformer) method that addresses the unique challenges of wire removal in video inpainting. Unlike conventional approaches that indiscriminately process all frame patches, Raformer employs a novel strategy to selectively bypass redundant parts, such as static background segments devoid of valuable information for inpainting. At the core of Raformer is the Redundancy-Aware Attention (RAA) module, which isolates and accentuates essential content through a coarse-grained, window-based attention mechanism. This is complemented by a Soft Feature Alignment (SFA) module, which refines these features and achieves end-to-end feature alignment. Extensive experiments on both the traditional video inpainting datasets and our proposed WRV2 dataset demonstrate that Raformer outperforms other state-of-the-art methods. Our codes and the WRV2 dataset will be made available at: <uri>https://github.com/Suyimu/WRV2</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"1795-1809"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10930654/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Video Wire Inpainting (VWI) is a prominent application in video inpainting, aimed at flawlessly removing wires in films or TV series, offering significant time and labor savings compared to manual frame-by-frame removal. However, wire removal poses greater challenges due to the wires being longer and slimmer than objects typically targeted in general video inpainting tasks, and often intersecting with people and background objects irregularly, which adds complexity to the inpainting process. Recognizing the limitations posed by existing video wire datasets, which are characterized by their small size, poor quality, and limited variety of scenes, we introduce a new VWI dataset with a novel mask generation strategy, namely Wire Removal Video Dataset 2 (WRV2) and Pseudo Wire-Shaped (PWS) Masks. WRV2 dataset comprises over 4,000 videos with an average length of 80 frames, designed to facilitate the development and efficacy of inpainting models. Building upon this, our research proposes the Redundancy-Aware Transformer (Raformer) method that addresses the unique challenges of wire removal in video inpainting. Unlike conventional approaches that indiscriminately process all frame patches, Raformer employs a novel strategy to selectively bypass redundant parts, such as static background segments devoid of valuable information for inpainting. At the core of Raformer is the Redundancy-Aware Attention (RAA) module, which isolates and accentuates essential content through a coarse-grained, window-based attention mechanism. This is complemented by a Soft Feature Alignment (SFA) module, which refines these features and achieves end-to-end feature alignment. Extensive experiments on both the traditional video inpainting datasets and our proposed WRV2 dataset demonstrate that Raformer outperforms other state-of-the-art methods. Our codes and the WRV2 dataset will be made available at: https://github.com/Suyimu/WRV2.

查看原文本刊更多论文

Raformer：用于视频线涂漆的冗余感知变压器

视频线材拼接（VWI）是视频线材拼接中的一个突出应用，旨在完美地去除电影或电视剧中的线材，与手动逐帧去除相比，可节省大量时间和劳动力。然而，电线的移除带来了更大的挑战，因为电线比一般视频绘画任务中通常针对的对象更长更细，并且经常与人和背景物体不规则相交，这增加了绘画过程的复杂性。认识到现有视频线数据集的局限性，其特点是体积小、质量差、场景种类有限，我们引入了一个新的VWI数据集，其具有新颖的掩码生成策略，即wire Removal video dataset 2 （WRV2）和Pseudo wire - shaped (PWS) mask。WRV2数据集包括超过4000个视频，平均长度为80帧，旨在促进绘制模型的开发和有效性。在此基础上，我们的研究提出了冗余感知变压器（Raformer）方法，该方法解决了视频喷漆中去除电线的独特挑战。与不加选择地处理所有帧补丁的传统方法不同，Raformer采用了一种新颖的策略来选择性地绕过冗余部分，例如缺乏有价值信息的静态背景段。Raformer的核心是冗余感知注意力（RAA）模块，它通过一个粗粒度的、基于窗口的注意力机制来隔离和强调重要内容。这是由软特征对齐（SFA）模块补充的，该模块对这些特征进行了细化，并实现了端到端的特征对齐。在传统视频喷漆数据集和我们提出的WRV2数据集上进行的大量实验表明，Raformer优于其他最先进的方法。我们的代码和WRV2数据集将在https://github.com/Suyimu/WRV2上提供。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量