FVIFormer: Flow-Guided Global-Local Aggregation Transformer Network for Video Inpainting

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems Pub Date : 2024-04-25 DOI:10.1109/JETCAS.2024.3392972

Weiqing Yan;Yiqiu Sun;Guanghui Yue;Wei Zhou;Hantao Liu

{"title":"FVIFormer: Flow-Guided Global-Local Aggregation Transformer Network for Video Inpainting","authors":"Weiqing Yan;Yiqiu Sun;Guanghui Yue;Wei Zhou;Hantao Liu","doi":"10.1109/JETCAS.2024.3392972","DOIUrl":null,"url":null,"abstract":"Video inpainting has been extensively used in recent years. Established works usually utilise the similarity between the missing region and its surrounding features to inpaint in the visually damaged content in a multi-stage manner. However, due to the complexity of the video content, it may result in the destruction of structural information of objects within the video. In addition to this, the presence of moving objects in the damaged regions of the video can further increase the difficulty of this work. To address these issues, we propose a flow-guided global-Local aggregation Transformer network for video inpainting. First, we use a pre-trained optical flow complementation network to repair the defective optical flow of video frames. Then, we propose a content inpainting module, which use the complete optical flow as a guide, and propagate the global content across the video frames using efficient temporal and spacial Transformer to inpaint in the corrupted regions of the video. Finally, we propose a structural rectification module to enhance the coherence of content around the missing regions via combining the extracted local and global features. In addition, considering the efficiency of the overall framework, we also optimized the self-attention mechanism to improve the speed of training and testing via depth-wise separable encoding. We validate the effectiveness of our method on the YouTube-VOS and DAVIS video datasets. Extensive experiment results demonstrate the effectiveness of our approach in edge-complementing video content that has undergone stabilisation algorithms.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 2","pages":"235-244"},"PeriodicalIF":3.7000,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10508737/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Video inpainting has been extensively used in recent years. Established works usually utilise the similarity between the missing region and its surrounding features to inpaint in the visually damaged content in a multi-stage manner. However, due to the complexity of the video content, it may result in the destruction of structural information of objects within the video. In addition to this, the presence of moving objects in the damaged regions of the video can further increase the difficulty of this work. To address these issues, we propose a flow-guided global-Local aggregation Transformer network for video inpainting. First, we use a pre-trained optical flow complementation network to repair the defective optical flow of video frames. Then, we propose a content inpainting module, which use the complete optical flow as a guide, and propagate the global content across the video frames using efficient temporal and spacial Transformer to inpaint in the corrupted regions of the video. Finally, we propose a structural rectification module to enhance the coherence of content around the missing regions via combining the extracted local and global features. In addition, considering the efficiency of the overall framework, we also optimized the self-attention mechanism to improve the speed of training and testing via depth-wise separable encoding. We validate the effectiveness of our method on the YouTube-VOS and DAVIS video datasets. Extensive experiment results demonstrate the effectiveness of our approach in edge-complementing video content that has undergone stabilisation algorithms.

查看原文本刊更多论文

FVIFormer：用于视频绘制的流量引导全局-本地聚合变换器网络

近年来，视频内画技术得到了广泛应用。已有的工作通常是利用缺失区域与其周围特征之间的相似性，以多阶段的方式对视觉上受损的内容进行补绘。然而，由于视频内容的复杂性，可能会导致视频中物体结构信息的破坏。除此之外，视频中受损区域存在移动物体也会进一步增加这项工作的难度。为了解决这些问题，我们提出了一种用于视频内画的流量引导全局-局部聚合变换器网络。首先，我们使用预先训练好的光流互补网络来修复视频帧的缺陷光流。然后，我们提出了一个内容喷绘模块，该模块以完整的光流为指导，利用高效的时空变换器在视频帧中传播全局内容，对视频中的损坏区域进行喷绘。最后，我们提出了一个结构矫正模块，通过结合提取的局部和全局特征来增强缺失区域周围内容的一致性。此外，考虑到整体框架的效率，我们还优化了自我注意机制，通过深度可分离编码提高了训练和测试的速度。我们在 YouTube-VOS 和 DAVIS 视频数据集上验证了我们方法的有效性。广泛的实验结果证明了我们的方法在对经过稳定算法处理的视频内容进行边缘补全时的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.50

自引率

2.20%

发文量

期刊介绍： The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.