{"title":"Multi-Level Feature-Guided Stereoscopic Video Quality Assessment Based on Transformer and Convolutional Neural Network","authors":"Yuan Chen, Sumei Li","doi":"10.1109/ICME55011.2023.00428","DOIUrl":null,"url":null,"abstract":"Stereoscopic video (3D video) has been increasingly applied in industry and entertainment. And the research of stereoscopic video quality assessment (SVQA) has become very important for promoting the development of stereoscopic video system. Many CNN-based models have emerged for SVQA task. However, these methods ignore the significance of the global information of the video frames for quality perception. In this paper, we propose a multi-level feature-fusion model based on Transformer and convolutional neural network (MFFTCNet) to assess the perceptual quality of the stereoscopic video. Firstly, we use global information from Transformer to guide local information from convolutional neural network (CNN). Moreover, we utilize low-level features in the CNN branch to guide high-level features. Besides, considering the binocular rivalry effect in the human vision system (HVS), we use 3D convolution to achieve rivalry fusion of binocular features. The proposed method is tested on two public stereoscopic video quality datasets. The result shows that this method correlates highly with human visual perception and outperforms state-of-the-art (SOTA) methods by a significant margin.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME55011.2023.00428","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Stereoscopic video (3D video) has been increasingly applied in industry and entertainment. And the research of stereoscopic video quality assessment (SVQA) has become very important for promoting the development of stereoscopic video system. Many CNN-based models have emerged for SVQA task. However, these methods ignore the significance of the global information of the video frames for quality perception. In this paper, we propose a multi-level feature-fusion model based on Transformer and convolutional neural network (MFFTCNet) to assess the perceptual quality of the stereoscopic video. Firstly, we use global information from Transformer to guide local information from convolutional neural network (CNN). Moreover, we utilize low-level features in the CNN branch to guide high-level features. Besides, considering the binocular rivalry effect in the human vision system (HVS), we use 3D convolution to achieve rivalry fusion of binocular features. The proposed method is tested on two public stereoscopic video quality datasets. The result shows that this method correlates highly with human visual perception and outperforms state-of-the-art (SOTA) methods by a significant margin.