CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level Semantic Information related to Human Feelings

arXiv (Cornell University) Pub Date : 2023-11-13 DOI:10.48550/arxiv.2311.07090

Mi, Yachun, Li, Yu, Shu, Yan, Hui, Chen, Zhou, Puchao, Liu, Shaohui

{"title":"CLiF-VQA: Enhancing Video Quality Assessment by Incorporating High-Level\n Semantic Information related to Human Feelings","authors":"Mi, Yachun, Li, Yu, Shu, Yan, Hui, Chen, Zhou, Puchao, Liu, Shaohui","doi":"10.48550/arxiv.2311.07090","DOIUrl":null,"url":null,"abstract":"Video Quality Assessment (VQA) aims to simulate the process of perceiving video quality by the human visual system (HVS). The judgments made by HVS are always influenced by human subjective feelings. However, most of the current VQA research focuses on capturing various distortions in the spatial and temporal domains of videos, while ignoring the impact of human feelings. In this paper, we propose CLiF-VQA, which considers both features related to human feelings and spatial features of videos. In order to effectively extract features related to human feelings from videos, we explore the consistency between CLIP and human feelings in video perception for the first time. Specifically, we design multiple objective and subjective descriptions closely related to human feelings as prompts. Further we propose a novel CLIP-based semantic feature extractor (SFE) which extracts features related to human feelings by sliding over multiple regions of the video frame. In addition, we further capture the low-level-aware features of the video through a spatial feature extraction module. The two different features are then aggregated thereby obtaining the quality score of the video. Extensive experiments show that the proposed CLiF-VQA exhibits excellent performance on several VQA datasets.","PeriodicalId":496270,"journal":{"name":"arXiv (Cornell University)","volume":"119 52","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv (Cornell University)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arxiv.2311.07090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Video Quality Assessment (VQA) aims to simulate the process of perceiving video quality by the human visual system (HVS). The judgments made by HVS are always influenced by human subjective feelings. However, most of the current VQA research focuses on capturing various distortions in the spatial and temporal domains of videos, while ignoring the impact of human feelings. In this paper, we propose CLiF-VQA, which considers both features related to human feelings and spatial features of videos. In order to effectively extract features related to human feelings from videos, we explore the consistency between CLIP and human feelings in video perception for the first time. Specifically, we design multiple objective and subjective descriptions closely related to human feelings as prompts. Further we propose a novel CLIP-based semantic feature extractor (SFE) which extracts features related to human feelings by sliding over multiple regions of the video frame. In addition, we further capture the low-level-aware features of the video through a spatial feature extraction module. The two different features are then aggregated thereby obtaining the quality score of the video. Extensive experiments show that the proposed CLiF-VQA exhibits excellent performance on several VQA datasets.

查看原文本刊更多论文

cliff - vqa:通过结合与人类情感相关的高级语义信息来增强视频质量评估

视频质量评估(VQA)旨在模拟人类视觉系统(HVS)感知视频质量的过程。HVS的判断总是受到人的主观感受的影响。然而，目前的VQA研究大多侧重于捕捉视频空间和时间域的各种扭曲，而忽略了人类情感的影响。在本文中，我们提出了cliff - vqa，它同时考虑了与人类情感相关的特征和视频的空间特征。为了有效地从视频中提取与人类情感相关的特征，我们首次探索了CLIP与人类情感在视频感知中的一致性。具体来说，我们设计了多个与人类情感密切相关的客观和主观描述作为提示。进一步，我们提出了一种新的基于clip的语义特征提取器(SFE)，它通过在视频帧的多个区域上滑动来提取与人类情感相关的特征。此外，我们通过空间特征提取模块进一步捕获视频的低级感知特征。然后将这两种不同的特征聚合，从而获得视频的质量分数。大量的实验表明，所提出的CLiF-VQA在多个VQA数据集上表现出优异的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv (Cornell University)

自引率

0.00%

发文量