Contrastive Self-Supervised Pre-Training for Video Quality Assessment

Pengfei Chen;Leida Li;Jinjian Wu;Weisheng Dong;Guangming Shi
{"title":"Contrastive Self-Supervised Pre-Training for Video Quality Assessment","authors":"Pengfei Chen;Leida Li;Jinjian Wu;Weisheng Dong;Guangming Shi","doi":"10.1109/TIP.2021.3130536","DOIUrl":null,"url":null,"abstract":"Video quality assessment (VQA) task is an ongoing small sample learning problem due to the costly effort required for manual annotation. Since existing VQA datasets are of limited scale, prior research tries to leverage models pre-trained on ImageNet to mitigate this kind of shortage. Nonetheless, these well-trained models targeting on image classification task can be sub-optimal when applied on VQA data from a significantly different domain. In this paper, we make the first attempt to perform self-supervised pre-training for VQA task built upon contrastive learning method, targeting at exploiting the plentiful unlabeled video data to learn feature representation in a simple-yet-effective way. Specifically, we implement this idea by first generating distorted video samples with diverse distortion characteristics and visual contents based on the proposed distortion augmentation strategy. Afterwards, we conduct contrastive learning to capture quality-aware information by maximizing the agreement on feature representations of future frames and their corresponding predictions in the embedding space. In addition, we further introduce distortion prediction task as an additional learning objective to push the model towards discriminating different distortion categories of the input video. Solving these prediction tasks jointly with the contrastive learning not only provides stronger surrogate supervision signals, but also learns the shared knowledge among the prediction tasks. Extensive experiments demonstrate that our approach sets a new state-of-the-art in self-supervised learning for VQA task. Our results also underscore that the learned pre-trained model can significantly benefit the existing learning based VQA models. Source code is available at \n<italic><uri>https://github.com/cpf0079/CSPT</uri></i>\n.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2021-12-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/9640574/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

Abstract

Video quality assessment (VQA) task is an ongoing small sample learning problem due to the costly effort required for manual annotation. Since existing VQA datasets are of limited scale, prior research tries to leverage models pre-trained on ImageNet to mitigate this kind of shortage. Nonetheless, these well-trained models targeting on image classification task can be sub-optimal when applied on VQA data from a significantly different domain. In this paper, we make the first attempt to perform self-supervised pre-training for VQA task built upon contrastive learning method, targeting at exploiting the plentiful unlabeled video data to learn feature representation in a simple-yet-effective way. Specifically, we implement this idea by first generating distorted video samples with diverse distortion characteristics and visual contents based on the proposed distortion augmentation strategy. Afterwards, we conduct contrastive learning to capture quality-aware information by maximizing the agreement on feature representations of future frames and their corresponding predictions in the embedding space. In addition, we further introduce distortion prediction task as an additional learning objective to push the model towards discriminating different distortion categories of the input video. Solving these prediction tasks jointly with the contrastive learning not only provides stronger surrogate supervision signals, but also learns the shared knowledge among the prediction tasks. Extensive experiments demonstrate that our approach sets a new state-of-the-art in self-supervised learning for VQA task. Our results also underscore that the learned pre-trained model can significantly benefit the existing learning based VQA models. Source code is available at https://github.com/cpf0079/CSPT .
视频质量评估的对比自监督预训练
视频质量评估(VQA)任务是一个持续的小样本学习问题,因为手动注释需要付出高昂的努力。由于现有的VQA数据集规模有限,先前的研究试图利用在ImageNet上预先训练的模型来缓解这种短缺。尽管如此,当应用于来自显著不同领域的VQA数据时,这些针对图像分类任务的训练有素的模型可能是次优的。在本文中,我们首次尝试基于对比学习方法对VQA任务进行自监督预训练,旨在利用大量未标记的视频数据,以一种简单而有效的方式学习特征表示。具体来说,我们通过首先基于所提出的失真增强策略生成具有不同失真特性和视觉内容的失真视频样本来实现这一想法。然后,我们进行对比学习,通过最大化嵌入空间中未来帧的特征表示及其相应预测的一致性来捕获质量感知信息。此外,我们进一步引入失真预测任务作为额外的学习目标,以推动模型区分输入视频的不同失真类别。将这些预测任务与对比学习结合起来解决,不仅可以提供更强的代理监督信号,还可以学习预测任务之间的共享知识。大量实验表明,我们的方法为VQA任务的自我监督学习奠定了新的基础。我们的结果还强调,学习的预训练模型可以显著有益于现有的基于学习的VQA模型。源代码位于https://github.com/cpf0079/CSPT.
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信