Viewport Prediction With Unsupervised Multiscale Causal Representation Learning for Virtual Reality Video Streaming

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-02-18 DOI:10.1109/TMM.2025.3543087

Yingjie Liu;Dan Wang;Bin Song

{"title":"Viewport Prediction With Unsupervised Multiscale Causal Representation Learning for Virtual Reality Video Streaming","authors":"Yingjie Liu;Dan Wang;Bin Song","doi":"10.1109/TMM.2025.3543087","DOIUrl":null,"url":null,"abstract":"The rise of the metaverse has driven the rapid development of various applications, such as Virtual Reality (VR) and Augmented Reality (AR). As a form of multimedia in the metaverse, VR video streaming (a.k.a., VR spherical video streaming and 360<inline-formula><tex-math>$^{\\circ }$</tex-math></inline-formula> video streaming) can provide users with a 360<inline-formula><tex-math>$^{\\circ }$</tex-math></inline-formula> immersive experience. Generally, transmitting VR video requires far more bandwidth than regular videos, which greatly strains existing network transmission. Predicting and selectively streaming VR video in the users' viewports in advance can reduce bandwidth consumption and system latency. However, existing methods either consider only historical viewport-based prediction methods or predict viewports by correlations between visual features of video frames, making it hard to adapt to the dynamics of users and video content. In the meantime, spurious correlations between visual features lead to inaccurate and unreliable prediction results. Hence, we propose an unsupervised multiscale causal representation learning (UMCRL)-based method to predict viewports in VR video streaming, including user preference-based and video content-based viewport prediction models. The former is designed by a position predictor to predict the future users' viewports based on their historical viewports in multiple video frames to adapt to users' dynamic preferences. The latter achieves unsupervised multiscale causal representation learning through an asymmetric causal regressor, used to infer the causalities between local and global-local visual features in video frames, thereby helping the model understand the contextual information in the videos. We embed the causalities in the transformer decoder via causal self-attention for predicting the users' viewports, adapting to the dynamic changes of video content. Finally, combining the results of the two aforementioned models yields the final prediction of the users' viewports. In addition, the QoE of users is satisfied by assigning different bitrates to the tiles in the viewport through a pyramid-based bitrate allocation. The experimental results verify the effectiveness of the method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4752-4764"},"PeriodicalIF":9.7000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891642/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The rise of the metaverse has driven the rapid development of various applications, such as Virtual Reality (VR) and Augmented Reality (AR). As a form of multimedia in the metaverse, VR video streaming (a.k.a., VR spherical video streaming and 360

$^{\circ }$

video streaming) can provide users with a 360

$^{\circ }$

immersive experience. Generally, transmitting VR video requires far more bandwidth than regular videos, which greatly strains existing network transmission. Predicting and selectively streaming VR video in the users' viewports in advance can reduce bandwidth consumption and system latency. However, existing methods either consider only historical viewport-based prediction methods or predict viewports by correlations between visual features of video frames, making it hard to adapt to the dynamics of users and video content. In the meantime, spurious correlations between visual features lead to inaccurate and unreliable prediction results. Hence, we propose an unsupervised multiscale causal representation learning (UMCRL)-based method to predict viewports in VR video streaming, including user preference-based and video content-based viewport prediction models. The former is designed by a position predictor to predict the future users' viewports based on their historical viewports in multiple video frames to adapt to users' dynamic preferences. The latter achieves unsupervised multiscale causal representation learning through an asymmetric causal regressor, used to infer the causalities between local and global-local visual features in video frames, thereby helping the model understand the contextual information in the videos. We embed the causalities in the transformer decoder via causal self-attention for predicting the users' viewports, adapting to the dynamic changes of video content. Finally, combining the results of the two aforementioned models yields the final prediction of the users' viewports. In addition, the QoE of users is satisfied by assigning different bitrates to the tiles in the viewport through a pyramid-based bitrate allocation. The experimental results verify the effectiveness of the method.

查看原文本刊更多论文

基于无监督多尺度因果表示学习的虚拟现实视频流视口预测

虚拟现实的兴起推动了虚拟现实（VR）和增强现实（AR）等各种应用的快速发展。VR视频流（又称VR球形视频流和360$^{\circ}$视频流）作为虚拟世界中的一种多媒体形式，可以为用户提供360$^{\circ}$沉浸式体验。一般来说，传输VR视频需要的带宽远远大于普通视频，这给现有的网络传输带来了很大的压力。提前预测并有选择地在用户视口中播放VR视频可以减少带宽消耗和系统延迟。然而，现有方法要么只考虑基于历史视口的预测方法，要么通过视频帧视觉特征之间的相关性来预测视口，难以适应用户和视频内容的动态变化。同时，视觉特征之间的虚假相关性导致预测结果不准确和不可靠。因此，我们提出了一种基于无监督多尺度因果表示学习（UMCRL）的方法来预测VR视频流中的视口，包括基于用户偏好和基于视频内容的视口预测模型。前者由位置预测器设计，根据用户在多个视频帧中的历史视口来预测未来用户的视口，以适应用户的动态偏好。后者通过非对称因果回归器实现无监督多尺度因果表示学习，用于推断视频帧中局部和全局-局部视觉特征之间的因果关系，从而帮助模型理解视频中的上下文信息。我们通过因果自注意将因果关系嵌入到转换器解码器中，用于预测用户的视口，以适应视频内容的动态变化。最后，将前面提到的两个模型的结果结合起来，就可以得到对用户视口的最终预测。此外，通过基于金字塔的比特率分配，为视口中的tile分配不同的比特率，可以满足用户的QoE。实验结果验证了该方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.