Deep Curriculum Reinforcement Learning for Adaptive 360° Video Streaming With Two-Stage Training

IF 3.2 1区计算机科学 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Broadcasting Pub Date : 2023-12-15 DOI:10.1109/TBC.2023.3334137

Yuhong Xie;Yuan Zhang;Tao Lin

{"title":"Deep Curriculum Reinforcement Learning for Adaptive 360° Video Streaming With Two-Stage Training","authors":"Yuhong Xie;Yuan Zhang;Tao Lin","doi":"10.1109/TBC.2023.3334137","DOIUrl":null,"url":null,"abstract":"Deep reinforcement learning (DRL) has demonstrated remarkable potential within the domain of video adaptive bitrate (ABR) optimization. However, training a well-performing DRL agent in the two-tier 360° video streaming system is non-trivial. The conventional DRL training approach fails to enable the model to start learning from simpler environments and then progressively explore more challenging ones, leading to suboptimal asymptotic performance and poor long-tail performance. In this paper, we propose a novel approach called DCRL360, which seamlessly integrates automatic curriculum learning (ACL) with DRL techniques to enable adaptive decision-making for 360° video bitrate selection and chunk scheduling. To tackle the training issue, we introduce a structured two-stage training framework. The first stage focuses on the selection of tasks conducive to learning, guided by a newly introduced training metric called Pscore, to enhance asymptotic performance. The newly introduced metric takes into consideration multiple facets, including performance improvement potential, the risk of being forgotten, and the uncertainty of a decision, to encourage the agent to train in rewarding environments. The second stage utilizes existing rule-based techniques to identify challenging tasks for fine-tuning the model, thereby alleviating the long-tail effect. Our experimental results demonstrate that DCRL360 outperforms state-of-the-art algorithms under various network conditions - including 5G/LTE/Broadband - with a remarkable improvement of 6.51-20.86% in quality of experience (QoE), as well as a reduction in bandwidth wastage by 10.60-31.50%.","PeriodicalId":13159,"journal":{"name":"IEEE Transactions on Broadcasting","volume":"70 2","pages":"441-452"},"PeriodicalIF":3.2000,"publicationDate":"2023-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Broadcasting","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10361536/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Deep reinforcement learning (DRL) has demonstrated remarkable potential within the domain of video adaptive bitrate (ABR) optimization. However, training a well-performing DRL agent in the two-tier 360° video streaming system is non-trivial. The conventional DRL training approach fails to enable the model to start learning from simpler environments and then progressively explore more challenging ones, leading to suboptimal asymptotic performance and poor long-tail performance. In this paper, we propose a novel approach called DCRL360, which seamlessly integrates automatic curriculum learning (ACL) with DRL techniques to enable adaptive decision-making for 360° video bitrate selection and chunk scheduling. To tackle the training issue, we introduce a structured two-stage training framework. The first stage focuses on the selection of tasks conducive to learning, guided by a newly introduced training metric called Pscore, to enhance asymptotic performance. The newly introduced metric takes into consideration multiple facets, including performance improvement potential, the risk of being forgotten, and the uncertainty of a decision, to encourage the agent to train in rewarding environments. The second stage utilizes existing rule-based techniques to identify challenging tasks for fine-tuning the model, thereby alleviating the long-tail effect. Our experimental results demonstrate that DCRL360 outperforms state-of-the-art algorithms under various network conditions - including 5G/LTE/Broadband - with a remarkable improvement of 6.51-20.86% in quality of experience (QoE), as well as a reduction in bandwidth wastage by 10.60-31.50%.

查看原文本刊更多论文

利用深度课程强化学习进行自适应 360° 视频流两阶段训练

深度强化学习（DRL）在视频自适应比特率（ABR）优化领域展现出了巨大的潜力。然而，在双层 360° 视频流系统中训练一个性能良好的 DRL 代理并非易事。传统的 DRL 训练方法无法使模型从较简单的环境开始学习，然后逐步探索更具挑战性的环境，从而导致渐近性能不理想和长尾性能不佳。在本文中，我们提出了一种名为 DCRL360 的新方法，该方法将自动课程学习 (ACL) 与 DRL 技术无缝集成，实现了 360° 视频比特率选择和块调度的自适应决策。为了解决训练问题，我们引入了一个结构化的两阶段训练框架。第一阶段的重点是选择有利于学习的任务，以新引入的名为 Pscore 的训练指标为指导，提高渐近性能。新引入的指标考虑了多个方面，包括提高性能的潜力、被遗忘的风险和决策的不确定性，以鼓励代理在有回报的环境中进行训练。第二阶段利用现有的基于规则的技术来确定具有挑战性的任务，以便对模型进行微调，从而缓解长尾效应。我们的实验结果表明，在各种网络条件下（包括 5G/LTE/宽带），DCRL360 的性能优于最先进的算法，体验质量（QoE）显著提高了 6.51-20.86%，带宽浪费减少了 10.60-31.50%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Broadcasting 工程技术-电信学

CiteScore

9.40

自引率

31.10%

发文量

审稿时长

6-12 weeks

期刊介绍： The Society’s Field of Interest is “Devices, equipment, techniques and systems related to broadcast technology, including the production, distribution, transmission, and propagation aspects.” In addition to this formal FOI statement, which is used to provide guidance to the Publications Committee in the selection of content, the AdCom has further resolved that “broadcast systems includes all aspects of transmission, propagation, and reception.”