Unsupervised Learning of Spatio-Temporal Representation with Multi-Task Learning for Video Retrieval

Vidit Kumar
{"title":"Unsupervised Learning of Spatio-Temporal Representation with Multi-Task Learning for Video Retrieval","authors":"Vidit Kumar","doi":"10.1109/NCC55593.2022.9806811","DOIUrl":null,"url":null,"abstract":"The majority of videos in the internet lack semantic tags, which complicates indexing and retrieval, and mandates the adoption of critical content-based analysis approaches. Earlier works relies on hand-crafted features, which hardly represents the temporal dynamics. Later, video representations learned through supervised deep learning methods were found to be effective, but at the cost of large labeled dataset. Recently, self-supervised based methods for video representation learning are proposed within the community to harness the freely available unlabeled videos. However, most of these methods are based on single pretext task, which limits the learning of generalizable representations. This work proposes to leverage multiple pretext tasks to enhance video representation learning and generalizability. We jointly optimized the C3D network by using multiple pretext tasks such as: rotation prediction, speed prediction, time direction prediction and instance discrimination. The nearest neighbour task is used to analyze the learned features. And for action recognition task, the network is further fine-tuned with pretrained weights. We use the UCF-101 dataset for the experiments and, achieves 28.45% retrieval accuracy (Recall@l), and 68.85% fine-tuned action recognition accuracy, which is better than state-of-the-arts.","PeriodicalId":403870,"journal":{"name":"2022 National Conference on Communications (NCC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC55593.2022.9806811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

The majority of videos in the internet lack semantic tags, which complicates indexing and retrieval, and mandates the adoption of critical content-based analysis approaches. Earlier works relies on hand-crafted features, which hardly represents the temporal dynamics. Later, video representations learned through supervised deep learning methods were found to be effective, but at the cost of large labeled dataset. Recently, self-supervised based methods for video representation learning are proposed within the community to harness the freely available unlabeled videos. However, most of these methods are based on single pretext task, which limits the learning of generalizable representations. This work proposes to leverage multiple pretext tasks to enhance video representation learning and generalizability. We jointly optimized the C3D network by using multiple pretext tasks such as: rotation prediction, speed prediction, time direction prediction and instance discrimination. The nearest neighbour task is used to analyze the learned features. And for action recognition task, the network is further fine-tuned with pretrained weights. We use the UCF-101 dataset for the experiments and, achieves 28.45% retrieval accuracy (Recall@l), and 68.85% fine-tuned action recognition accuracy, which is better than state-of-the-arts.
基于多任务学习的时空表征无监督学习视频检索
互联网上的大多数视频缺乏语义标签,这使得索引和检索变得复杂,并要求采用关键的基于内容的分析方法。早期的作品依赖于手工制作的特征,很难表现出时间的动态。后来,通过监督深度学习方法学习的视频表示被发现是有效的,但代价是大量标记数据集。近年来,社区提出了基于自监督的视频表示学习方法,以利用免费的未标记视频。然而,这些方法大多是基于单一的借口任务,这限制了可泛化表征的学习。本研究提出利用多借口任务来增强视频表示学习和泛化能力。通过旋转预测、速度预测、时间方向预测和实例识别等多个任务,对C3D网络进行了联合优化。最近邻任务用于分析学习到的特征。对于动作识别任务,使用预训练的权值对网络进行进一步微调。我们使用UCF-101数据集进行实验,获得了28.45%的检索准确率(Recall@l)和68.85%的微调动作识别准确率,优于目前的水平。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信