Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

Lili Wei, Congyan Lang, Liqian Liang, Songhe Feng, Tao Wang, Shidi Chen
{"title":"Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion","authors":"Lili Wei, Congyan Lang, Liqian Liang, Songhe Feng, Tao Wang, Shidi Chen","doi":"10.1145/3506716","DOIUrl":null,"url":null,"abstract":"Recently, concerning the challenge of collecting large-scale explicitly annotated videos, weakly supervised video object segmentation (WSVOS) using video tags has attracted much attention. Existing WSVOS approaches follow a general pipeline including two phases, i.e., a pseudo masks generation phase and a refinement phase. To explore the intrinsic property and correlation buried in the video frames, most of them focus on the later phase by introducing optical flow as temporal information to provide more supervision. However, these optical flow-based studies are greatly affected by illumination and distortion and lack consideration of the discriminative capacity of multi-level deep features. In this article, with the goal of capturing more effective temporal information and investigating a temporal information fusion strategy accordingly, we propose a unified WSVOS model by adopting a two-branch architecture with a multi-level cross-branch fusion strategy, named as dual-attention cross-branch fusion network (DACF-Net). Concretely, the two branches of DACF-Net, i.e., a temporal prediction subnetwork (TPN) and a spatial segmentation subnetwork (SSN), are used for extracting temporal information and generating predicted segmentation masks, respectively. To perform the cross-branch fusion between TPN and SSN, we propose a dual-attention fusion module that can be plugged into the SSN flexibly. We also pose a cross-frame coherence loss (CFCL) to achieve smooth segmentation results by exploiting the coherence of masks produced by TPN and SSN. Extensive experiments demonstrate the effectiveness of proposed approach compared with the state-of-the-arts on two challenging datasets, i.e., Davis-2016 and YouTube-Objects.","PeriodicalId":123526,"journal":{"name":"ACM Transactions on Intelligent Systems and Technology (TIST)","volume":"79 2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Intelligent Systems and Technology (TIST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3506716","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Recently, concerning the challenge of collecting large-scale explicitly annotated videos, weakly supervised video object segmentation (WSVOS) using video tags has attracted much attention. Existing WSVOS approaches follow a general pipeline including two phases, i.e., a pseudo masks generation phase and a refinement phase. To explore the intrinsic property and correlation buried in the video frames, most of them focus on the later phase by introducing optical flow as temporal information to provide more supervision. However, these optical flow-based studies are greatly affected by illumination and distortion and lack consideration of the discriminative capacity of multi-level deep features. In this article, with the goal of capturing more effective temporal information and investigating a temporal information fusion strategy accordingly, we propose a unified WSVOS model by adopting a two-branch architecture with a multi-level cross-branch fusion strategy, named as dual-attention cross-branch fusion network (DACF-Net). Concretely, the two branches of DACF-Net, i.e., a temporal prediction subnetwork (TPN) and a spatial segmentation subnetwork (SSN), are used for extracting temporal information and generating predicted segmentation masks, respectively. To perform the cross-branch fusion between TPN and SSN, we propose a dual-attention fusion module that can be plugged into the SSN flexibly. We also pose a cross-frame coherence loss (CFCL) to achieve smooth segmentation results by exploiting the coherence of masks produced by TPN and SSN. Extensive experiments demonstrate the effectiveness of proposed approach compared with the state-of-the-arts on two challenging datasets, i.e., Davis-2016 and YouTube-Objects.
基于双注意交叉分支融合的弱监督视频目标分割
近年来,针对收集大规模显式标注视频的难题,基于视频标签的弱监督视频对象分割(WSVOS)受到了广泛关注。现有的WSVOS方法遵循包括两个阶段的一般管道,即伪掩码生成阶段和细化阶段。为了挖掘隐藏在视频帧中的内在属性和相关性,大多数都将重点放在后期,通过引入光流作为时间信息来提供更多的监督。然而,这些基于光流的研究受光照和畸变的影响较大,缺乏对多层次深层特征判别能力的考虑。为了捕获更有效的时间信息并研究相应的时间信息融合策略,本文提出了一种统一的WSVOS模型,该模型采用双分支架构和多级跨分支融合策略,称为双注意力跨分支融合网络(dual-attention cross-branch fusion network, DACF-Net)。具体来说,DACF-Net的两个分支,即时间预测子网(TPN)和空间分割子网(SSN),分别用于提取时间信息和生成预测的分割掩码。为了实现TPN和SSN之间的跨分支融合,我们提出了一种可灵活插入SSN的双注意力融合模块。我们还提出了一种跨帧相干损失(CFCL),通过利用TPN和SSN产生的掩模的相干性来实现平滑的分割结果。大量的实验表明,在两个具有挑战性的数据集(即Davis-2016和YouTube-Objects)上,与最先进的方法相比,所提出的方法是有效的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信