{"title":"Viewport Prediction With Unsupervised Multiscale Causal Representation Learning for Virtual Reality Video Streaming","authors":"Yingjie Liu;Dan Wang;Bin Song","doi":"10.1109/TMM.2025.3543087","DOIUrl":null,"url":null,"abstract":"The rise of the metaverse has driven the rapid development of various applications, such as Virtual Reality (VR) and Augmented Reality (AR). As a form of multimedia in the metaverse, VR video streaming (a.k.a., VR spherical video streaming and 360<inline-formula><tex-math>$^{\\circ }$</tex-math></inline-formula> video streaming) can provide users with a 360<inline-formula><tex-math>$^{\\circ }$</tex-math></inline-formula> immersive experience. Generally, transmitting VR video requires far more bandwidth than regular videos, which greatly strains existing network transmission. Predicting and selectively streaming VR video in the users' viewports in advance can reduce bandwidth consumption and system latency. However, existing methods either consider only historical viewport-based prediction methods or predict viewports by correlations between visual features of video frames, making it hard to adapt to the dynamics of users and video content. In the meantime, spurious correlations between visual features lead to inaccurate and unreliable prediction results. Hence, we propose an unsupervised multiscale causal representation learning (UMCRL)-based method to predict viewports in VR video streaming, including user preference-based and video content-based viewport prediction models. The former is designed by a position predictor to predict the future users' viewports based on their historical viewports in multiple video frames to adapt to users' dynamic preferences. The latter achieves unsupervised multiscale causal representation learning through an asymmetric causal regressor, used to infer the causalities between local and global-local visual features in video frames, thereby helping the model understand the contextual information in the videos. We embed the causalities in the transformer decoder via causal self-attention for predicting the users' viewports, adapting to the dynamic changes of video content. Finally, combining the results of the two aforementioned models yields the final prediction of the users' viewports. In addition, the QoE of users is satisfied by assigning different bitrates to the tiles in the viewport through a pyramid-based bitrate allocation. The experimental results verify the effectiveness of the method.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4752-4764"},"PeriodicalIF":9.7000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10891642/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The rise of the metaverse has driven the rapid development of various applications, such as Virtual Reality (VR) and Augmented Reality (AR). As a form of multimedia in the metaverse, VR video streaming (a.k.a., VR spherical video streaming and 360$^{\circ }$ video streaming) can provide users with a 360$^{\circ }$ immersive experience. Generally, transmitting VR video requires far more bandwidth than regular videos, which greatly strains existing network transmission. Predicting and selectively streaming VR video in the users' viewports in advance can reduce bandwidth consumption and system latency. However, existing methods either consider only historical viewport-based prediction methods or predict viewports by correlations between visual features of video frames, making it hard to adapt to the dynamics of users and video content. In the meantime, spurious correlations between visual features lead to inaccurate and unreliable prediction results. Hence, we propose an unsupervised multiscale causal representation learning (UMCRL)-based method to predict viewports in VR video streaming, including user preference-based and video content-based viewport prediction models. The former is designed by a position predictor to predict the future users' viewports based on their historical viewports in multiple video frames to adapt to users' dynamic preferences. The latter achieves unsupervised multiscale causal representation learning through an asymmetric causal regressor, used to infer the causalities between local and global-local visual features in video frames, thereby helping the model understand the contextual information in the videos. We embed the causalities in the transformer decoder via causal self-attention for predicting the users' viewports, adapting to the dynamic changes of video content. Finally, combining the results of the two aforementioned models yields the final prediction of the users' viewports. In addition, the QoE of users is satisfied by assigning different bitrates to the tiles in the viewport through a pyramid-based bitrate allocation. The experimental results verify the effectiveness of the method.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.