Is an Object-Centric Video Representation Beneficial for Transfer?

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision Pub Date : 2022-07-20 DOI:10.48550/arXiv.2207.10075

Chuhan Zhang, Ankush Gupta, Andrew Zisserman

{"title":"Is an Object-Centric Video Representation Beneficial for Transfer?","authors":"Chuhan Zhang, Ankush Gupta, Andrew Zisserman","doi":"10.48550/arXiv.2207.10075","DOIUrl":null,"url":null,"abstract":"The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a new object-centric video recognition model based on a transformer architecture. The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory 'modalities' of the video clip. We also introduce a novel trajectory contrast loss to further enhance objectness in these summary vectors. With experiments on four datasets -- SomethingSomething-V2, SomethingElse, Action Genome and EpicKitchens -- we show that the object-centric model outperforms prior video representations (both object-agnostic and object-aware), when: (1) classifying actions on unseen objects and unseen environments; (2) low-shot learning of novel classes; (3) linear probe to other downstream tasks; as well as (4) for standard action classification.","PeriodicalId":87238,"journal":{"name":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2207.10075","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

Abstract

The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a new object-centric video recognition model based on a transformer architecture. The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory 'modalities' of the video clip. We also introduce a novel trajectory contrast loss to further enhance objectness in these summary vectors. With experiments on four datasets -- SomethingSomething-V2, SomethingElse, Action Genome and EpicKitchens -- we show that the object-centric model outperforms prior video representations (both object-agnostic and object-aware), when: (1) classifying actions on unseen objects and unseen environments; (2) low-shot learning of novel classes; (3) linear probe to other downstream tasks; as well as (4) for standard action classification.

查看原文本刊更多论文

以对象为中心的视频表示是否有利于传输?

这项工作的目的是学习以对象为中心的视频表示，目的是提高对新任务的可转移性，即不同于动作分类的预训练任务的任务。为此，我们提出了一种新的基于变压器结构的以对象为中心的视频识别模型。该模型为视频学习了一组以对象为中心的总结向量，并使用这些向量融合视频剪辑的视觉和时空轨迹“模式”。我们还引入了一种新的轨迹对比度损失来进一步增强这些总结向量的客观性。通过对四个数据集(somethingthing - v2, SomethingElse, Action Genome和EpicKitchens)的实验，我们表明，当:(1)在看不见的物体和看不见的环境上对动作进行分类时，以对象为中心的模型优于先前的视频表示(包括对象不可知和对象感知);(2)小说类低档学习;(3)对其他下游任务的线性探测;以及(4)为标准动作分类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer vision - ACCV ... : ... Asian Conference on Computer Vision : proceedings. Asian Conference on Computer Vision

自引率

0.00%

发文量