用于无参照屏幕内容视频质量评估的深度学习方法

IF 3.2 1区计算机科学 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Broadcasting Pub Date : 2024-03-26 DOI:10.1109/TBC.2024.3374042

Ngai-Wing Kwong;Yui-Lam Chan;Sik-Ho Tsang;Ziyin Huang;Kin-Man Lam

{"title":"用于无参照屏幕内容视频质量评估的深度学习方法","authors":"Ngai-Wing Kwong;Yui-Lam Chan;Sik-Ho Tsang;Ziyin Huang;Kin-Man Lam","doi":"10.1109/TBC.2024.3374042","DOIUrl":null,"url":null,"abstract":"Screen content video (SCV) has drawn much more attention than ever during the COVID-19 period and has evolved from a niche to a mainstream due to the recent proliferation of remote offices, online meetings, shared-screen collaboration, and gaming live streaming. Therefore, quality assessments for screen content media are highly demanded to maintain service quality recently. Although many practical natural scene video quality assessment methods have been proposed and achieved promising results, these methods cannot be applied to the screen content video quality assessment (SCVQA) task directly since the content characteristics of SCV are substantially different from natural scene video. Besides, only one no-reference SCVQA (NR-SCVQA) method, which requires handcrafted features, has been proposed in the literature. Therefore, we propose the first deep learning approach explicitly designed for NR-SCVQA. First, a multi-channel convolutional neural network (CNN) model is used to extract spatial quality features of pictorial and textual regions separately. Since there is no human annotated quality for each screen content frame (SCF), the CNN model is pre-trained in a multi-task self-supervised fashion to extract spatial quality feature representation of SCF. Second, we propose a time-distributed CNN transformer model (TCNNT) to further process all SCF spatial quality feature representations of an SCV and learn spatial and temporal features simultaneously so that high-level spatiotemporal features of SCV can be extracted and used to assess the whole SCV quality. Experimental results demonstrate the robustness and validity of our model, which is closely related to human perception.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"70 2","pages":"555-569"},"PeriodicalIF":3.2000,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep Learning Approach for No-Reference Screen Content Video Quality Assessment\",\"authors\":\"Ngai-Wing Kwong;Yui-Lam Chan;Sik-Ho Tsang;Ziyin Huang;Kin-Man Lam\",\"doi\":\"10.1109/TBC.2024.3374042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Screen content video (SCV) has drawn much more attention than ever during the COVID-19 period and has evolved from a niche to a mainstream due to the recent proliferation of remote offices, online meetings, shared-screen collaboration, and gaming live streaming. Therefore, quality assessments for screen content media are highly demanded to maintain service quality recently. Although many practical natural scene video quality assessment methods have been proposed and achieved promising results, these methods cannot be applied to the screen content video quality assessment (SCVQA) task directly since the content characteristics of SCV are substantially different from natural scene video. Besides, only one no-reference SCVQA (NR-SCVQA) method, which requires handcrafted features, has been proposed in the literature. Therefore, we propose the first deep learning approach explicitly designed for NR-SCVQA. First, a multi-channel convolutional neural network (CNN) model is used to extract spatial quality features of pictorial and textual regions separately. Since there is no human annotated quality for each screen content frame (SCF), the CNN model is pre-trained in a multi-task self-supervised fashion to extract spatial quality feature representation of SCF. Second, we propose a time-distributed CNN transformer model (TCNNT) to further process all SCF spatial quality feature representations of an SCV and learn spatial and temporal features simultaneously so that high-level spatiotemporal features of SCV can be extracted and used to assess the whole SCV quality. Experimental results demonstrate the robustness and validity of our model, which is closely related to human perception.\",\"PeriodicalId\":13159,\"journal\":{\"name\":\"IEEE Transactions on Broadcasting\",\"volume\":\"70 2\",\"pages\":\"555-569\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2024-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Broadcasting\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10479473/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Broadcasting","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10479473/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

在 COVID-19 期间，屏幕内容视频（SCV）比以往任何时候都更受关注，由于近年来远程办公、在线会议、共享屏幕协作和游戏直播的普及，它已从一个小众领域发展成为主流领域。因此，屏幕内容媒体的质量评估是近期维持服务质量的高要求。尽管已经提出了许多实用的自然场景视频质量评估方法并取得了可喜的成果，但由于 SCV 的内容特征与自然场景视频有很大不同，因此这些方法无法直接应用于屏幕内容视频质量评估（SCVQA）任务。此外，文献中只提出了一种无参考 SCVQA（NR-SCVQA）方法，该方法需要手工制作特征。因此，我们提出了第一种专为 NR-SCVQA 设计的深度学习方法。首先，使用多通道卷积神经网络（CNN）模型分别提取图像和文本区域的空间质量特征。由于每个屏幕内容帧（SCF）都没有人工标注的质量，因此 CNN 模型采用多任务自监督方式进行预训练，以提取 SCF 的空间质量特征表示。其次，我们提出了一种时间分布式 CNN 变换器模型（TCNNT），以进一步处理 SCV 的所有 SCF 空间质量特征表示，并同时学习空间和时间特征，从而提取 SCV 的高级时空特征，用于评估整个 SCV 质量。实验结果证明了我们的模型与人类感知密切相关，具有鲁棒性和有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deep Learning Approach for No-Reference Screen Content Video Quality Assessment

Screen content video (SCV) has drawn much more attention than ever during the COVID-19 period and has evolved from a niche to a mainstream due to the recent proliferation of remote offices, online meetings, shared-screen collaboration, and gaming live streaming. Therefore, quality assessments for screen content media are highly demanded to maintain service quality recently. Although many practical natural scene video quality assessment methods have been proposed and achieved promising results, these methods cannot be applied to the screen content video quality assessment (SCVQA) task directly since the content characteristics of SCV are substantially different from natural scene video. Besides, only one no-reference SCVQA (NR-SCVQA) method, which requires handcrafted features, has been proposed in the literature. Therefore, we propose the first deep learning approach explicitly designed for NR-SCVQA. First, a multi-channel convolutional neural network (CNN) model is used to extract spatial quality features of pictorial and textual regions separately. Since there is no human annotated quality for each screen content frame (SCF), the CNN model is pre-trained in a multi-task self-supervised fashion to extract spatial quality feature representation of SCF. Second, we propose a time-distributed CNN transformer model (TCNNT) to further process all SCF spatial quality feature representations of an SCV and learn spatial and temporal features simultaneously so that high-level spatiotemporal features of SCV can be extracted and used to assess the whole SCV quality. Experimental results demonstrate the robustness and validity of our model, which is closely related to human perception.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Broadcasting 工程技术-电信学

CiteScore

9.40

自引率

31.10%

发文量

审稿时长

6-12 weeks

期刊介绍： The Society’s Field of Interest is “Devices, equipment, techniques and systems related to broadcast technology, including the production, distribution, transmission, and propagation aspects.” In addition to this formal FOI statement, which is used to provide guidance to the Publications Committee in the selection of content, the AdCom has further resolved that “broadcast systems includes all aspects of transmission, propagation, and reception.”