Unsupervised Learning of Spatio-Temporal Representation with Multi-Task Learning for Video Retrieval

2022 National Conference on Communications (NCC) Pub Date : 2022-05-24 DOI:10.1109/NCC55593.2022.9806811

Vidit Kumar

{"title":"Unsupervised Learning of Spatio-Temporal Representation with Multi-Task Learning for Video Retrieval","authors":"Vidit Kumar","doi":"10.1109/NCC55593.2022.9806811","DOIUrl":null,"url":null,"abstract":"The majority of videos in the internet lack semantic tags, which complicates indexing and retrieval, and mandates the adoption of critical content-based analysis approaches. Earlier works relies on hand-crafted features, which hardly represents the temporal dynamics. Later, video representations learned through supervised deep learning methods were found to be effective, but at the cost of large labeled dataset. Recently, self-supervised based methods for video representation learning are proposed within the community to harness the freely available unlabeled videos. However, most of these methods are based on single pretext task, which limits the learning of generalizable representations. This work proposes to leverage multiple pretext tasks to enhance video representation learning and generalizability. We jointly optimized the C3D network by using multiple pretext tasks such as: rotation prediction, speed prediction, time direction prediction and instance discrimination. The nearest neighbour task is used to analyze the learned features. And for action recognition task, the network is further fine-tuned with pretrained weights. We use the UCF-101 dataset for the experiments and, achieves 28.45% retrieval accuracy (Recall@l), and 68.85% fine-tuned action recognition accuracy, which is better than state-of-the-arts.","PeriodicalId":403870,"journal":{"name":"2022 National Conference on Communications (NCC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC55593.2022.9806811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

The majority of videos in the internet lack semantic tags, which complicates indexing and retrieval, and mandates the adoption of critical content-based analysis approaches. Earlier works relies on hand-crafted features, which hardly represents the temporal dynamics. Later, video representations learned through supervised deep learning methods were found to be effective, but at the cost of large labeled dataset. Recently, self-supervised based methods for video representation learning are proposed within the community to harness the freely available unlabeled videos. However, most of these methods are based on single pretext task, which limits the learning of generalizable representations. This work proposes to leverage multiple pretext tasks to enhance video representation learning and generalizability. We jointly optimized the C3D network by using multiple pretext tasks such as: rotation prediction, speed prediction, time direction prediction and instance discrimination. The nearest neighbour task is used to analyze the learned features. And for action recognition task, the network is further fine-tuned with pretrained weights. We use the UCF-101 dataset for the experiments and, achieves 28.45% retrieval accuracy (Recall@l), and 68.85% fine-tuned action recognition accuracy, which is better than state-of-the-arts.

查看原文本刊更多论文

基于多任务学习的时空表征无监督学习视频检索

互联网上的大多数视频缺乏语义标签，这使得索引和检索变得复杂，并要求采用关键的基于内容的分析方法。早期的作品依赖于手工制作的特征，很难表现出时间的动态。后来，通过监督深度学习方法学习的视频表示被发现是有效的，但代价是大量标记数据集。近年来，社区提出了基于自监督的视频表示学习方法，以利用免费的未标记视频。然而，这些方法大多是基于单一的借口任务，这限制了可泛化表征的学习。本研究提出利用多借口任务来增强视频表示学习和泛化能力。通过旋转预测、速度预测、时间方向预测和实例识别等多个任务，对C3D网络进行了联合优化。最近邻任务用于分析学习到的特征。对于动作识别任务，使用预训练的权值对网络进行进一步微调。我们使用UCF-101数据集进行实验，获得了28.45%的检索准确率(Recall@l)和68.85%的微调动作识别准确率，优于目前的水平。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 National Conference on Communications (NCC)

自引率

0.00%

发文量