Paying Attention to Video Object Pattern Understanding.

IF 20.8 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2021-07-01 Epub Date: 2021-06-08 DOI:10.1109/TPAMI.2020.2966453

Wenguan Wang, Jianbing Shen, Xiankai Lu, Steven C H Hoi, Haibin Ling

{"title":"Paying Attention to Video Object Pattern Understanding.","authors":"Wenguan Wang, Jianbing Shen, Xiankai Lu, Steven C H Hoi, Haibin Ling","doi":"10.1109/TPAMI.2020.2966453","DOIUrl":null,"url":null,"abstract":"This paper conducts a systematic study on the role of visual attention in video object pattern understanding. By elaborately annotating three popular video segmentation datasets (DAVIS 16, Youtube-Objects, and SegTrack V2) with dynamic eye-tracking data in the unsupervised video object segmentation (UVOS) setting. For the first time, we quantitatively verified the high consistency of visual attention behavior among human observers, and found strong correlation between human attention and explicit primary object judgments during dynamic, task-driven viewing. Such novel observations provide an in-depth insight of the underlying rationale behind video object pattens. Inspired by these findings, we decouple UVOS into two sub-tasks: UVOS-driven Dynamic Visual Attention Prediction (DVAP) in spatiotemporal domain, and Attention-Guided Object Segmentation (AGOS) in spatial domain. Our UVOS solution enjoys three major advantages: 1) modular training without using expensive video segmentation annotations, instead, using more affordable dynamic fixation data to train the initial video attention module and using existing fixation-segmentation paired static/image data to train the subsequent segmentation module; 2) comprehensive foreground understanding through multi-source learning; and 3) additional interpretability from the biologically-inspired and assessable attention. Experiments on four popular benchmarks show that, even without using expensive video object mask annotations, our model achieves compelling performance compared with state-of-the-arts and enjoys fast processing speed (10 fps on a single GPU). Our collected eye-tracking data and algorithm implementations have been made publicly available at https://github.com/wenguanwang/AGS.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"43 7","pages":"2413-2428"},"PeriodicalIF":20.8000,"publicationDate":"2021-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TPAMI.2020.2966453","citationCount":"73","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/TPAMI.2020.2966453","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2021/6/8 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 73

Abstract

This paper conducts a systematic study on the role of visual attention in video object pattern understanding. By elaborately annotating three popular video segmentation datasets (DAVIS ₁₆, Youtube-Objects, and SegTrack _V₂) with dynamic eye-tracking data in the unsupervised video object segmentation (UVOS) setting. For the first time, we quantitatively verified the high consistency of visual attention behavior among human observers, and found strong correlation between human attention and explicit primary object judgments during dynamic, task-driven viewing. Such novel observations provide an in-depth insight of the underlying rationale behind video object pattens. Inspired by these findings, we decouple UVOS into two sub-tasks: UVOS-driven Dynamic Visual Attention Prediction (DVAP) in spatiotemporal domain, and Attention-Guided Object Segmentation (AGOS) in spatial domain. Our UVOS solution enjoys three major advantages: 1) modular training without using expensive video segmentation annotations, instead, using more affordable dynamic fixation data to train the initial video attention module and using existing fixation-segmentation paired static/image data to train the subsequent segmentation module; 2) comprehensive foreground understanding through multi-source learning; and 3) additional interpretability from the biologically-inspired and assessable attention. Experiments on four popular benchmarks show that, even without using expensive video object mask annotations, our model achieves compelling performance compared with state-of-the-arts and enjoys fast processing speed (10 fps on a single GPU). Our collected eye-tracking data and algorithm implementations have been made publicly available at https://github.com/wenguanwang/AGS.

查看原文本刊更多论文

关注视频对象模式理解。

本文对视觉注意在视频对象模式理解中的作用进行了系统的研究。通过在无监督视频对象分割(UVOS)设置中使用动态眼动追踪数据精心注释三个流行的视频分割数据集(DAVIS 16, Youtube-Objects和SegTrack V2)。我们首次定量验证了人类观察者之间视觉注意行为的高度一致性，并发现在动态、任务驱动的观看过程中，人类注意力与显性主要客体判断之间存在很强的相关性。这种新颖的观察为视频对象模式背后的基本原理提供了深入的见解。受这些发现的启发，我们将UVOS分解为两个子任务:UVOS驱动的动态视觉注意力预测(DVAP)在时空域和注意力引导的目标分割(AGOS)在空间域。我们的UVOS解决方案具有三大优势:1)模块化训练，无需使用昂贵的视频分割注释，而是使用更实惠的动态注视数据来训练初始视频注意力模块，并使用现有的注视分割匹配静态/图像数据来训练后续的分割模块;2)通过多源学习全面了解前景;3)从生物启发和可评估的关注中获得额外的可解释性。在四个流行的基准测试上的实验表明，即使没有使用昂贵的视频对象掩码注释，我们的模型与最先进的性能相比也取得了令人信服的性能，并且具有快速的处理速度(单个GPU上10 fps)。我们收集的眼球追踪数据和算法实现已经在https://github.com/wenguanwang/AGS上公开发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.