Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

ACM Transactions on Intelligent Systems and Technology (TIST) Pub Date : 2022-03-03 DOI:10.1145/3506716

Lili Wei, Congyan Lang, Liqian Liang, Songhe Feng, Tao Wang, Shidi Chen

{"title":"Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion","authors":"Lili Wei, Congyan Lang, Liqian Liang, Songhe Feng, Tao Wang, Shidi Chen","doi":"10.1145/3506716","DOIUrl":null,"url":null,"abstract":"Recently, concerning the challenge of collecting large-scale explicitly annotated videos, weakly supervised video object segmentation (WSVOS) using video tags has attracted much attention. Existing WSVOS approaches follow a general pipeline including two phases, i.e., a pseudo masks generation phase and a refinement phase. To explore the intrinsic property and correlation buried in the video frames, most of them focus on the later phase by introducing optical flow as temporal information to provide more supervision. However, these optical flow-based studies are greatly affected by illumination and distortion and lack consideration of the discriminative capacity of multi-level deep features. In this article, with the goal of capturing more effective temporal information and investigating a temporal information fusion strategy accordingly, we propose a unified WSVOS model by adopting a two-branch architecture with a multi-level cross-branch fusion strategy, named as dual-attention cross-branch fusion network (DACF-Net). Concretely, the two branches of DACF-Net, i.e., a temporal prediction subnetwork (TPN) and a spatial segmentation subnetwork (SSN), are used for extracting temporal information and generating predicted segmentation masks, respectively. To perform the cross-branch fusion between TPN and SSN, we propose a dual-attention fusion module that can be plugged into the SSN flexibly. We also pose a cross-frame coherence loss (CFCL) to achieve smooth segmentation results by exploiting the coherence of masks produced by TPN and SSN. Extensive experiments demonstrate the effectiveness of proposed approach compared with the state-of-the-arts on two challenging datasets, i.e., Davis-2016 and YouTube-Objects.","PeriodicalId":123526,"journal":{"name":"ACM Transactions on Intelligent Systems and Technology (TIST)","volume":"79 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Intelligent Systems and Technology (TIST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3506716","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Recently, concerning the challenge of collecting large-scale explicitly annotated videos, weakly supervised video object segmentation (WSVOS) using video tags has attracted much attention. Existing WSVOS approaches follow a general pipeline including two phases, i.e., a pseudo masks generation phase and a refinement phase. To explore the intrinsic property and correlation buried in the video frames, most of them focus on the later phase by introducing optical flow as temporal information to provide more supervision. However, these optical flow-based studies are greatly affected by illumination and distortion and lack consideration of the discriminative capacity of multi-level deep features. In this article, with the goal of capturing more effective temporal information and investigating a temporal information fusion strategy accordingly, we propose a unified WSVOS model by adopting a two-branch architecture with a multi-level cross-branch fusion strategy, named as dual-attention cross-branch fusion network (DACF-Net). Concretely, the two branches of DACF-Net, i.e., a temporal prediction subnetwork (TPN) and a spatial segmentation subnetwork (SSN), are used for extracting temporal information and generating predicted segmentation masks, respectively. To perform the cross-branch fusion between TPN and SSN, we propose a dual-attention fusion module that can be plugged into the SSN flexibly. We also pose a cross-frame coherence loss (CFCL) to achieve smooth segmentation results by exploiting the coherence of masks produced by TPN and SSN. Extensive experiments demonstrate the effectiveness of proposed approach compared with the state-of-the-arts on two challenging datasets, i.e., Davis-2016 and YouTube-Objects.

查看原文本刊更多论文

基于双注意交叉分支融合的弱监督视频目标分割

近年来，针对收集大规模显式标注视频的难题，基于视频标签的弱监督视频对象分割(WSVOS)受到了广泛关注。现有的WSVOS方法遵循包括两个阶段的一般管道，即伪掩码生成阶段和细化阶段。为了挖掘隐藏在视频帧中的内在属性和相关性，大多数都将重点放在后期，通过引入光流作为时间信息来提供更多的监督。然而，这些基于光流的研究受光照和畸变的影响较大，缺乏对多层次深层特征判别能力的考虑。为了捕获更有效的时间信息并研究相应的时间信息融合策略，本文提出了一种统一的WSVOS模型，该模型采用双分支架构和多级跨分支融合策略，称为双注意力跨分支融合网络(dual-attention cross-branch fusion network, DACF-Net)。具体来说，DACF-Net的两个分支，即时间预测子网(TPN)和空间分割子网(SSN)，分别用于提取时间信息和生成预测的分割掩码。为了实现TPN和SSN之间的跨分支融合，我们提出了一种可灵活插入SSN的双注意力融合模块。我们还提出了一种跨帧相干损失(CFCL)，通过利用TPN和SSN产生的掩模的相干性来实现平滑的分割结果。大量的实验表明，在两个具有挑战性的数据集(即Davis-2016和YouTube-Objects)上，与最先进的方法相比，所提出的方法是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Intelligent Systems and Technology (TIST)

自引率

0.00%

发文量