Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning

2023 18th International Conference on Machine Vision and Applications (MVA) Pub Date : 2021-12-07 DOI:10.23919/MVA57639.2023.10216260

Srijan Das, M. Ryoo

引用次数: 0

Abstract

In this paper, we address the challenge of obtaining large-scale unlabelled video datasets for contrastive representation learning in real-world applications. We present a novel video augmentation technique for self-supervised learning, called Cross-Modal Manifold Cutmix (CMMC), which generates augmented samples by combining different modalities in videos. By embedding a video tesseract into another across two modalities in the feature space, our method enhances the quality of learned video representations. We perform extensive experiments on two small-scale video datasets, UCF101 and HMDB51, for action recognition and video retrieval tasks. Our approach is also shown to be effective on the NTU dataset with limited domain knowledge. Our CMMC achieves comparable performance to other self-supervised methods while using less training data for both downstream tasks.

查看原文本刊更多论文

自监督视频表示学习的交叉模态流形混合

在本文中，我们解决了在实际应用中获取大规模未标记视频数据集用于对比表示学习的挑战。我们提出了一种新的用于自监督学习的视频增强技术，称为跨模态流形切割混合(CMMC)，它通过组合视频中的不同模态来生成增强样本。通过在特征空间中跨两种模态将一个视频块嵌入到另一个视频块中，我们的方法提高了学习到的视频表示的质量。我们在UCF101和HMDB51两个小型视频数据集上进行了广泛的实验，用于动作识别和视频检索任务。我们的方法在有限领域知识的NTU数据集上也被证明是有效的。我们的CMMC在使用更少的训练数据完成下游任务的同时，实现了与其他自监督方法相当的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 18th International Conference on Machine Vision and Applications (MVA)

自引率

0.00%

发文量