{"title":"Unsupervised Learning of Spatio-Temporal Representation with Multi-Task Learning for Video Retrieval","authors":"Vidit Kumar","doi":"10.1109/NCC55593.2022.9806811","DOIUrl":null,"url":null,"abstract":"The majority of videos in the internet lack semantic tags, which complicates indexing and retrieval, and mandates the adoption of critical content-based analysis approaches. Earlier works relies on hand-crafted features, which hardly represents the temporal dynamics. Later, video representations learned through supervised deep learning methods were found to be effective, but at the cost of large labeled dataset. Recently, self-supervised based methods for video representation learning are proposed within the community to harness the freely available unlabeled videos. However, most of these methods are based on single pretext task, which limits the learning of generalizable representations. This work proposes to leverage multiple pretext tasks to enhance video representation learning and generalizability. We jointly optimized the C3D network by using multiple pretext tasks such as: rotation prediction, speed prediction, time direction prediction and instance discrimination. The nearest neighbour task is used to analyze the learned features. And for action recognition task, the network is further fine-tuned with pretrained weights. We use the UCF-101 dataset for the experiments and, achieves 28.45% retrieval accuracy (Recall@l), and 68.85% fine-tuned action recognition accuracy, which is better than state-of-the-arts.","PeriodicalId":403870,"journal":{"name":"2022 National Conference on Communications (NCC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 National Conference on Communications (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC55593.2022.9806811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
The majority of videos in the internet lack semantic tags, which complicates indexing and retrieval, and mandates the adoption of critical content-based analysis approaches. Earlier works relies on hand-crafted features, which hardly represents the temporal dynamics. Later, video representations learned through supervised deep learning methods were found to be effective, but at the cost of large labeled dataset. Recently, self-supervised based methods for video representation learning are proposed within the community to harness the freely available unlabeled videos. However, most of these methods are based on single pretext task, which limits the learning of generalizable representations. This work proposes to leverage multiple pretext tasks to enhance video representation learning and generalizability. We jointly optimized the C3D network by using multiple pretext tasks such as: rotation prediction, speed prediction, time direction prediction and instance discrimination. The nearest neighbour task is used to analyze the learned features. And for action recognition task, the network is further fine-tuned with pretrained weights. We use the UCF-101 dataset for the experiments and, achieves 28.45% retrieval accuracy (Recall@l), and 68.85% fine-tuned action recognition accuracy, which is better than state-of-the-arts.