Deep Transport Network for Unsupervised Video Object Segmentation

2021 IEEE/CVF International Conference on Computer Vision (ICCV) Pub Date : 2021-10-01 DOI:10.1109/ICCV48922.2021.00866

Kaihua Zhang, Zicheng Zhao, Dong Liu, Qingshan Liu, Bo Liu

{"title":"Deep Transport Network for Unsupervised Video Object Segmentation","authors":"Kaihua Zhang, Zicheng Zhao, Dong Liu, Qingshan Liu, Bo Liu","doi":"10.1109/ICCV48922.2021.00866","DOIUrl":null,"url":null,"abstract":"The popular unsupervised video object segmentation methods fuse the RGB frame and optical flow via a two-stream network. However, they cannot handle the distracting noises in each input modality, which may vastly deteriorate the model performance. We propose to establish the correspondence between the input modalities while suppressing the distracting signals via optimal structural matching. Given a video frame, we extract the dense local features from the RGB image and optical flow, and treat them as two complex structured representations. The Wasserstein distance is then employed to compute the global optimal flows to transport the features in one modality to the other, where the magnitude of each flow measures the extent of the alignment between two local features. To plug the structural matching into a two-stream network for end-to-end training, we factorize the input cost matrix into small spatial blocks and design a differentiable long-short Sinkhorn module consisting of a long-distant Sinkhorn layer and a short-distant Sinkhorn layer. We integrate the module into a dedicated two-stream network and dub our model TransportNet. Our experiments show that aligning motion-appearance yields the state-of-the-art results on the popular video object segmentation datasets.","PeriodicalId":6820,"journal":{"name":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","volume":"38 4","pages":"8761-8770"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE/CVF International Conference on Computer Vision (ICCV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCV48922.2021.00866","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

The popular unsupervised video object segmentation methods fuse the RGB frame and optical flow via a two-stream network. However, they cannot handle the distracting noises in each input modality, which may vastly deteriorate the model performance. We propose to establish the correspondence between the input modalities while suppressing the distracting signals via optimal structural matching. Given a video frame, we extract the dense local features from the RGB image and optical flow, and treat them as two complex structured representations. The Wasserstein distance is then employed to compute the global optimal flows to transport the features in one modality to the other, where the magnitude of each flow measures the extent of the alignment between two local features. To plug the structural matching into a two-stream network for end-to-end training, we factorize the input cost matrix into small spatial blocks and design a differentiable long-short Sinkhorn module consisting of a long-distant Sinkhorn layer and a short-distant Sinkhorn layer. We integrate the module into a dedicated two-stream network and dub our model TransportNet. Our experiments show that aligning motion-appearance yields the state-of-the-art results on the popular video object segmentation datasets.

查看原文本刊更多论文

基于深度传输网络的无监督视频对象分割

目前流行的无监督视频目标分割方法通过两流网络将RGB帧和光流融合在一起。然而，它们不能处理每个输入模态中的干扰噪声，这可能会大大降低模型的性能。我们提出建立输入模态之间的对应关系，同时通过最优结构匹配抑制干扰信号。给定一个视频帧，我们从RGB图像和光流中提取密集的局部特征，并将它们作为两个复杂的结构化表示。然后使用Wasserstein距离来计算全局最优流量，以将一种模态的特征传输到另一种模态，其中每个流量的大小测量两个局部特征之间的对齐程度。为了将结构匹配插入到端到端训练的两流网络中，我们将输入成本矩阵分解为小的空间块，并设计了一个可微的长-短Sinkhorn模块，该模块由一个远距离Sinkhorn层和一个近距Sinkhorn层组成。我们将该模块集成到专用的双流网络中，并将我们的模型命名为TransportNet。我们的实验表明，在流行的视频对象分割数据集上，对齐运动-外观产生了最先进的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

自引率

0.00%

发文量