Global Spectral Filter Memory Network for Video Object Segmentation

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision Pub Date : 2022-10-11 DOI:10.48550/arXiv.2210.05567

Yong Liu, R. Yu, Jiahao Wang, Xinyuan Zhao, Yitong Wang, Yansong Tang, Yujiu Yang

{"title":"Global Spectral Filter Memory Network for Video Object Segmentation","authors":"Yong Liu, R. Yu, Jiahao Wang, Xinyuan Zhao, Yitong Wang, Yansong Tang, Yujiu Yang","doi":"10.48550/arXiv.2210.05567","DOIUrl":null,"url":null,"abstract":"This paper studies semi-supervised video object segmentation through boosting intra-frame interaction. Recent memory network-based methods focus on exploiting inter-frame temporal reference while paying little attention to intra-frame spatial dependency. Specifically, these segmentation model tends to be susceptible to interference from unrelated nontarget objects in a certain frame. To this end, we propose Global Spectral Filter Memory network (GSFM), which improves intra-frame interaction through learning long-term spatial dependencies in the spectral domain. The key components of GSFM is 2D (inverse) discrete Fourier transform for spatial information mixing. Besides, we empirically find low frequency feature should be enhanced in encoder (backbone) while high frequency for decoder (segmentation head). We attribute this to semantic information extracting role for encoder and fine-grained details highlighting role for decoder. Thus, Low (High) Frequency Module is proposed to fit this circumstance. Extensive experiments on the popular DAVIS and YouTube-VOS benchmarks demonstrate that GSFM noticeably outperforms the baseline method and achieves state-of-the-art performance. Besides, extensive analysis shows that the proposed modules are reasonable and of great generalization ability. Our source code is available at https://github.com/workforai/GSFM.","PeriodicalId":72676,"journal":{"name":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","volume":"103 1","pages":"648-665"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2210.05567","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

This paper studies semi-supervised video object segmentation through boosting intra-frame interaction. Recent memory network-based methods focus on exploiting inter-frame temporal reference while paying little attention to intra-frame spatial dependency. Specifically, these segmentation model tends to be susceptible to interference from unrelated nontarget objects in a certain frame. To this end, we propose Global Spectral Filter Memory network (GSFM), which improves intra-frame interaction through learning long-term spatial dependencies in the spectral domain. The key components of GSFM is 2D (inverse) discrete Fourier transform for spatial information mixing. Besides, we empirically find low frequency feature should be enhanced in encoder (backbone) while high frequency for decoder (segmentation head). We attribute this to semantic information extracting role for encoder and fine-grained details highlighting role for decoder. Thus, Low (High) Frequency Module is proposed to fit this circumstance. Extensive experiments on the popular DAVIS and YouTube-VOS benchmarks demonstrate that GSFM noticeably outperforms the baseline method and achieves state-of-the-art performance. Besides, extensive analysis shows that the proposed modules are reasonable and of great generalization ability. Our source code is available at https://github.com/workforai/GSFM.

查看原文本刊更多论文

用于视频目标分割的全局频谱滤波记忆网络

本文通过增强帧内交互来研究半监督视频目标分割。目前基于记忆网络的方法主要关注帧间的时间参考，而很少关注帧内的空间依赖。具体来说，这些分割模型容易受到某一帧内不相关的非目标物体的干扰。为此，我们提出了全局频谱滤波记忆网络(GSFM)，该网络通过学习频谱域的长期空间依赖关系来改善帧内交互。GSFM的关键部分是用于空间信息混合的二维(逆)离散傅里叶变换。此外，我们还通过经验发现，在编码器(主干)中需要增强低频特征，而在解码器(分割头)中需要增强高频特征。我们将其归因于编码器的语义信息提取作用和解码器的细粒度细节突出作用。因此，提出了低(高)频率模块来适应这种情况。在流行的DAVIS和YouTube-VOS基准测试上进行的大量实验表明，GSFM明显优于基线方法，并实现了最先进的性能。此外，广泛的分析表明，所提出的模块是合理的，具有很强的泛化能力。我们的源代码可从https://github.com/workforai/GSFM获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer vision - ECCV ... : ... European Conference on Computer Vision : proceedings. European Conference on Computer Vision

自引率

0.00%

发文量