CLIPVQA: Video Quality Assessment via CLIP

IF 4.8 1区计算机科学 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Broadcasting Pub Date : 2024-12-27 DOI:10.1109/TBC.2024.3511927

Fengchuang Xing;Mingjie Li;Yuan-Gen Wang;Guopu Zhu;Xiaochun Cao

{"title":"CLIPVQA: Video Quality Assessment via CLIP","authors":"Fengchuang Xing;Mingjie Li;Yuan-Gen Wang;Guopu Zhu;Xiaochun Cao","doi":"10.1109/TBC.2024.3511927","DOIUrl":null,"url":null,"abstract":"In learning vision-language representations from Web-scale data, the contrastive language-image pre-training (CLIP) mechanism has demonstrated a remarkable performance in many vision tasks. However, its application to the widely studied video quality assessment (VQA) task is still an open issue. In this paper, we propose an efficient and effective CLIP-based Transformer method for the VQA problem (CLIPVQA). Specifically, we first design an effective video frame perception paradigm with the goal of extracting the rich spatiotemporal quality and content information among video frames. Then, the spatiotemporal quality features are adequately integrated together using a self-attention mechanism to yield video-level quality representation. To utilize the quality language descriptions of videos for supervision, we develop a CLIP-based encoder for language embedding, which is then fully aggregated with the generated content information via a cross-attention module for producing video-language representation. Finally, the video-level quality and video-language representations are fused together for final video quality prediction, where a vectorized regression loss is employed for efficient end-to-end optimization. Comprehensive experiments are conducted on eight in-the-wild video datasets with diverse resolutions to evaluate the performance of CLIPVQA. The experimental results show that the proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods. A series of ablation studies are also performed to validate the effectiveness of each module in CLIPVQA.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"71 1","pages":"291-306"},"PeriodicalIF":4.8000,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Broadcasting","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10817097/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

In learning vision-language representations from Web-scale data, the contrastive language-image pre-training (CLIP) mechanism has demonstrated a remarkable performance in many vision tasks. However, its application to the widely studied video quality assessment (VQA) task is still an open issue. In this paper, we propose an efficient and effective CLIP-based Transformer method for the VQA problem (CLIPVQA). Specifically, we first design an effective video frame perception paradigm with the goal of extracting the rich spatiotemporal quality and content information among video frames. Then, the spatiotemporal quality features are adequately integrated together using a self-attention mechanism to yield video-level quality representation. To utilize the quality language descriptions of videos for supervision, we develop a CLIP-based encoder for language embedding, which is then fully aggregated with the generated content information via a cross-attention module for producing video-language representation. Finally, the video-level quality and video-language representations are fused together for final video quality prediction, where a vectorized regression loss is employed for efficient end-to-end optimization. Comprehensive experiments are conducted on eight in-the-wild video datasets with diverse resolutions to evaluate the performance of CLIPVQA. The experimental results show that the proposed CLIPVQA achieves new state-of-the-art VQA performance and up to 37% better generalizability than existing benchmark VQA methods. A series of ablation studies are also performed to validate the effectiveness of each module in CLIPVQA.

查看原文本刊更多论文

CLIPVQA：视频质量评估通过剪辑

对比语言-图像预训练（CLIP）机制在网络尺度的视觉语言表征学习中，在许多视觉任务中表现出了显著的效果。然而，它在视频质量评估（VQA）任务中的应用仍是一个有待解决的问题。在本文中，我们提出了一种高效的基于CLIPVQA的VQA问题的变压器方法（CLIPVQA）。具体而言，我们首先设计了一种有效的视频帧感知范式，以提取视频帧之间丰富的时空质量和内容信息。然后，利用自关注机制将时空质量特征充分整合在一起，得到视频级质量表示。为了利用视频的高质量语言描述进行监督，我们开发了一个基于clip的语言嵌入编码器，然后通过交叉关注模块将生成的内容信息与生成的内容信息完全聚合，以产生视频语言表示。最后，将视频级质量和视频语言表示融合在一起进行最终的视频质量预测，其中使用向量化回归损失进行有效的端到端优化。在8个不同分辨率的野外视频数据集上进行了综合实验，以评估CLIPVQA的性能。实验结果表明，所提出的CLIPVQA达到了最新的VQA性能，与现有的基准VQA方法相比，泛化能力提高了37%。还进行了一系列消融研究，以验证CLIPVQA中每个模块的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Broadcasting 工程技术-电信学

CiteScore

9.40

自引率

31.10%

发文量

审稿时长

6-12 weeks

期刊介绍： The Society’s Field of Interest is “Devices, equipment, techniques and systems related to broadcast technology, including the production, distribution, transmission, and propagation aspects.” In addition to this formal FOI statement, which is used to provide guidance to the Publications Committee in the selection of content, the AdCom has further resolved that “broadcast systems includes all aspects of transmission, propagation, and reception.”