Video Wire Inpainting via Hierarchical Feature Mixture

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-03-03 DOI:10.1016/j.imavis.2025.105460

Zhong Ji, Yimu Su, Yan Zhang, Shuangming Yang, Yanwei Pang

{"title":"Video Wire Inpainting via Hierarchical Feature Mixture","authors":"Zhong Ji, Yimu Su, Yan Zhang, Shuangming Yang, Yanwei Pang","doi":"10.1016/j.imavis.2025.105460","DOIUrl":null,"url":null,"abstract":"<div><div>Video wire inpainting aims at automatically eliminating visible wires from film footage, significantly streamlining post-production workflows. Previous models address redundancy in wire removal by eliminating redundant blocks to enhance focus on crucial wire details for more accurate reconstruction. However, once redundancy is removed, the disorganized non-redundant blocks disrupt temporal and spatial coherence, making seamless inpainting challenging. The absence of multi-scale feature fusion further limits the model’s ability to handle different wire scales and blend inpainted regions with complex backgrounds. To address these challenges, we propose a Hierarchical Feature Mixture Network (HFM-Net) that integrates two novel modules: a Hierarchical Transformer Module (HTM) and a Spatio-temporal Feature Mixture Module (SFM). Specifically, the HTM employs redundancy-aware attention modules and lightweight transformers to reorganize and fuse key high- and low-dimensional patches. The lightweight transformers are sufficient due to the reduced number of non-redundant blocks processing. By aggregating similar features, these transformers guide the alignment of non-redundant blocks and achieve effective spatio-temporal synchronization. Building on this, the SFM incorporates gated convolutions and GRU to enhance spatial and temporal integration further. Gated convolutions fuse low- and high-dimensional features, while the GRU captures temporal dependencies, enabling seamless inpainting of dynamic wire patterns. Additionally, we introduce a lightweight 3D separable convolution discriminator to improve video quality during the inpainting process while reducing computational costs. Experimental results demonstrate that HFM-Net achieves state-of-the-art performance on the video wire removal task.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"157 ","pages":"Article 105460"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625000484","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video wire inpainting aims at automatically eliminating visible wires from film footage, significantly streamlining post-production workflows. Previous models address redundancy in wire removal by eliminating redundant blocks to enhance focus on crucial wire details for more accurate reconstruction. However, once redundancy is removed, the disorganized non-redundant blocks disrupt temporal and spatial coherence, making seamless inpainting challenging. The absence of multi-scale feature fusion further limits the model’s ability to handle different wire scales and blend inpainted regions with complex backgrounds. To address these challenges, we propose a Hierarchical Feature Mixture Network (HFM-Net) that integrates two novel modules: a Hierarchical Transformer Module (HTM) and a Spatio-temporal Feature Mixture Module (SFM). Specifically, the HTM employs redundancy-aware attention modules and lightweight transformers to reorganize and fuse key high- and low-dimensional patches. The lightweight transformers are sufficient due to the reduced number of non-redundant blocks processing. By aggregating similar features, these transformers guide the alignment of non-redundant blocks and achieve effective spatio-temporal synchronization. Building on this, the SFM incorporates gated convolutions and GRU to enhance spatial and temporal integration further. Gated convolutions fuse low- and high-dimensional features, while the GRU captures temporal dependencies, enabling seamless inpainting of dynamic wire patterns. Additionally, we introduce a lightweight 3D separable convolution discriminator to improve video quality during the inpainting process while reducing computational costs. Experimental results demonstrate that HFM-Net achieves state-of-the-art performance on the video wire removal task.

查看原文本刊更多论文

通过分层特征混合的视频线绘制

视频线在绘画的目的是自动消除可见的电线从电影素材，大大简化后期制作工作流程。以前的模型通过消除冗余块来解决电线去除中的冗余，以增强对关键电线细节的关注，从而实现更准确的重建。然而，一旦去除冗余，杂乱无章的非冗余块会破坏时间和空间的一致性，使无缝的绘画变得具有挑战性。缺乏多尺度特征融合进一步限制了模型处理不同线尺度和混合复杂背景的彩绘区域的能力。为了解决这些挑战，我们提出了一个分层特征混合网络（HFM-Net），该网络集成了两个新颖的模块：分层变压器模块（HTM）和时空特征混合模块（SFM）。具体来说，HTM采用冗余感知注意力模块和轻量级转换器来重组和融合关键的高维和低维补丁。由于减少了非冗余块处理的数量，轻量级变压器就足够了。这些变形器通过聚合相似特征，引导非冗余块的对齐，实现有效的时空同步。在此基础上，SFM结合门控卷积和GRU进一步增强了空间和时间的整合。门控卷积融合了低维和高维特征，而GRU捕获了时间依赖性，从而实现了动态线模式的无缝融合。此外，我们还引入了一种轻量级的3D可分离卷积鉴别器，以提高喷漆过程中的视频质量，同时降低计算成本。实验结果表明，HFM-Net在视频线去除任务中达到了最先进的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.