{"title":"Efficient Distributed Edge Computing for Dependent Delay-Sensitive Tasks in Multi-Operator Multi-Access Networks","authors":"Alia Asheralieva;Dusit Niyato;Xuetao Wei","doi":"10.1109/TPDS.2024.3468892","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3468892","url":null,"abstract":"We study the problem of distributed computing in the \u0000<italic>multi-operator multi-access edge computing</i>\u0000 (MEC) network for \u0000<italic>dependent tasks</i>\u0000. Every task comprises several \u0000<italic>sub-tasks</i>\u0000 which are executed based on logical precedence modelled as a \u0000<italic>directed acyclic graph</i>\u0000. In the graph, each vertex is a sub-task, each edge – precedence constraint, such that a sub-task can only be started after all its preceding sub-tasks are completed. Tasks are executed by MEC servers with the assistance of nearby edge devices, so that the MEC network can be viewed as a \u0000<italic>distributed</i>\u0000 “\u0000<italic>primary-secondary node</i>\u0000” system where each MEC server acts as a \u0000<italic>primary node</i>\u0000 (PN) deciding on sub-tasks assigned to its \u0000<italic>secondary nodes</i>\u0000 (SNs), i.e., nearby edge devices. The PN's decision problem is complex, as its SNs can be associated with other \u0000<italic>neighboring</i>\u0000 PNs. In this case, the available processing resources of SNs depend on the sub-task assignment decisions of all neighboring PNs. Since PNs are controlled by different operators, they do not coordinate their decisions, and each PN is uncertain about the sub-task assignments of its neighbors (and, thus, the available resources of its SNs). To address this problem, we propose a novel framework based on a \u0000<italic>graphical Bayesian game</i>\u0000, where PNs play under uncertainty about their neighbors’ decisions. We prove that the game has a \u0000<italic>perfect Bayesian equilibrium</i>\u0000 (PBE) yielding \u0000<italic>unique optimal values</i>\u0000, and formulate new \u0000<italic>Bayesian reinforcement learning</i>\u0000 and \u0000<italic>Bayesian deep reinforcement learning</i>\u0000 algorithms enabling each PN to reach the PBE autonomously (without communicating with other PNs).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2559-2577"},"PeriodicalIF":5.6,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiqi Lin;Youshan Miao;Guanbin Xu;Cheng Li;Olli Saarikivi;Saeed Maleki;Fan Yang
{"title":"Efficient Schedule Construction for Distributed Execution of Large DNN Models","authors":"Zhiqi Lin;Youshan Miao;Guanbin Xu;Cheng Li;Olli Saarikivi;Saeed Maleki;Fan Yang","doi":"10.1109/TPDS.2024.3466913","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3466913","url":null,"abstract":"Increasingly complex and diverse deep neural network (DNN) models necessitate distributing the execution across multiple devices for training and inference tasks, and also require carefully planned schedules for performance. However, existing practices often rely on predefined schedules that may not fully exploit the benefits of emerging diverse model-aware operator placement strategies. Handcrafting high-efficiency schedules can be challenging due to the large and varying schedule space. This paper presents Tessel, an automated system that searches for efficient schedules for distributed DNN training and inference for diverse operator placement strategies. To reduce search costs, Tessel leverages the insight that the most efficient schedules often exhibit repetitive pattern (\u0000<italic>repetend</i>\u0000) across different data inputs. This leads to a two-phase approach: repetend construction and schedule completion. By exploring schedules for various operator placement strategies, Tessel significantly improves both training and inference performance. Experiments with representative DNN models demonstrate that Tessel achieves up to 5.5× training performance speedup and up to 38% inference latency reduction.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2375-2391"},"PeriodicalIF":5.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuangwu Chen;Jiangming Li;Qifeng Yuan;Huasen He;Sen Li;Jian Yang
{"title":"Two-Timescale Joint Optimization of Task Scheduling and Resource Scaling in Multi-Data Center System Based on Multi-Agent Deep Reinforcement Learning","authors":"Shuangwu Chen;Jiangming Li;Qifeng Yuan;Huasen He;Sen Li;Jian Yang","doi":"10.1109/TPDS.2024.3467212","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3467212","url":null,"abstract":"As a new computing paradigm, multi-data center computing enables service providers to deploy their applications close to the users. However, due to the spatio-temporal changes in workloads, it is challenging to coordinate multiple distributed data centers to provide high-quality services while reducing service operation costs. To address this challenge, this article studies the joint optimization problem of task scheduling and resource scaling in multi-data center systems. Since the task scheduling and the resource scaling are usually performed in different timescales, we decompose the joint optimization problem into two sub-problems and propose a two-timescale optimization framework. The short-timescale task scheduling can promptly relieve the bursty arrivals of computing tasks, and the long-timescale resource scaling can adapt well to the long-term changes in workloads. To address the distributed optimization problem, we propose a two-timescale multi-agent deep reinforcement learning algorithm. In order to characterize the graph-structured states of connected data centers, we develop a directed graph convolutional network based global state representation model. The evaluation indicates that the proposed algorithm is able to reduce both the task makespan and the task timeout while maintaining a reasonable cost.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2331-2346"},"PeriodicalIF":5.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bingyi Zhang;Rajgopal Kannan;Carl Busart;Viktor K. Prasanna
{"title":"VisionAGILE: A Versatile Domain-Specific Accelerator for Computer Vision Tasks","authors":"Bingyi Zhang;Rajgopal Kannan;Carl Busart;Viktor K. Prasanna","doi":"10.1109/TPDS.2024.3466891","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3466891","url":null,"abstract":"The emergence of diverse machine learning (ML) models has led to groundbreaking revolutions in computer vision (CV). These ML models include convolutional neural networks (CNNs), graph neural networks (GNNs), and vision transformers (ViTs). However, existing hardware accelerators designed for CV lack the versatility to support various ML models, potentially limiting their applicability to real-world scenarios. To address this limitation, we introduce VisionAGILE, a domain-specific accelerator designed to be versatile and capable of accommodating a range of ML models, including CNNs, GNNs, and ViTs. VisionAGILE comprises a compiler, a runtime system, and a hardware accelerator. For the hardware accelerator, we develop a novel unified architecture with a flexible data path and memory organization to support the computation primitives in various ML models. Regarding the compiler design, we develop a unified compilation workflow that maps various ML models to the proposed hardware accelerator. The runtime system executes dynamic sparsity exploitation to reduce inference latency and dynamic task scheduling for workload balance. The compiler, the runtime system, and the hardware accelerator work synergistically to support a variety of ML models in CV, enabling low-latency inference. We deploy the hardware accelerator on a state-of-the-art data center FPGA (Xilinx Alveo U250). We evaluate VisionAGILE on diverse ML models for CV, including CNNs, GNNs, hybrid models (comprising both CNN and GNN), and ViTs. The experimental results indicate that, compared with state-of-the-art CPU (GPU) implementations, VisionAGILE achieves a speedup of \u0000<inline-formula><tex-math>$81.7times$</tex-math></inline-formula>\u0000 (\u0000<inline-formula><tex-math>$4.8times$</tex-math></inline-formula>\u0000) in terms of latency. Evaluated on standalone CNNs, GNNs, and ViTs, VisionAGILE demonstrates comparable or higher performance with state-of-the-art CNN accelerators, GNN accelerators, and ViT accelerators, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2405-2422"},"PeriodicalIF":5.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142447219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment","authors":"Yuyang Jin;Runxin Zhong;Saiqin Long;Jidong Zhai","doi":"10.1109/TPDS.2024.3462092","DOIUrl":"10.1109/TPDS.2024.3462092","url":null,"abstract":"Many artificial intelligence applications based on convolutional neural networks are directly deployed on mobile devices to avoid network unavailability and user privacy leakage. However, the significant increase in model parameter volumes makes it difficult to achieve high-performance convolutional neural network inference on these mobile devices with limited computing power. Weight pruning is one of the main approaches to compress models by reducing model parameters and computational operations, which also introduces irregular sparsity of neural networks, leading to inefficient computation and memory access during inference. This work proposes an end-to-end framework, namely MCPruner, for efficient inference of pruned convolutional neural networks on mobile devices by aligning the sparse patterns with hardware execution features in computation, memory access, and parallelism. It first co-designs pruning methods and code generation optimizations for the alignment of non-zero weight count and vector width, to improve computational efficiency while ensuring accuracy. During the code generation, it applies a sparse pattern-aware format to reduce inefficient memory accesses. Besides, convolution computations are reordered for alignment, and then mapped to parallel threads on accelerated units to achieve high parallelism. Experimental results using several commonly used models and datasets on the ARM-based Hikey970 demonstrate that our work outperforms state-of-the-art methods in inference efficiency, with no accuracy degradation.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2208-2223"},"PeriodicalIF":5.6,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hanfei Yu;Hao Wang;Jian Li;Xu Yuan;Seung-Jong Park
{"title":"Freyr $^+$+: Harvesting Idle Resources in Serverless Computing via Deep Reinforcement Learning","authors":"Hanfei Yu;Hao Wang;Jian Li;Xu Yuan;Seung-Jong Park","doi":"10.1109/TPDS.2024.3462294","DOIUrl":"10.1109/TPDS.2024.3462294","url":null,"abstract":"Serverless computing has revolutionized online service development and deployment with ease-to-use operations, auto-scaling, fine-grained resource allocation, and pay-as-you-go pricing. However, a gap remains in configuring serverless functions—the actual resource consumption may vary due to function types, dependencies, and input data sizes, thus mismatching the static resource configuration by users. Dynamic resource consumption against static configuration may lead to either poor function execution performance or low utilization. This paper proposes \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000, a novel resource manager (RM) that dynamically harvests idle resources from over-provisioned functions to accelerate under-provisioned functions for serverless platforms. \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 monitors each function's resource utilization in real-time and detects the mismatches between user configuration and actual resource consumption. We design deep reinforcement learning (DRL) algorithms with attention-enhanced embedding, incremental learning, and safeguard mechanism for \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 to harvest idle resources safely and accelerate functions efficiently. We have implemented and deployed a \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 prototype in a 13-node Apache OpenWhisk cluster using AWS EC2. \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 is evaluated on both large-scale simulation and real-world testbed. Experimental results show that \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 harvests 38% of function invocations’ idle resources and accelerates 39% of invocations using harvested resources. \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 reduces the 99th-percentile function response latency by 26% compared to the baseline RMs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2254-2269"},"PeriodicalIF":5.6,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Cross-Cloud Partial Reduce With CREW","authors":"Shouxi Luo;Renyi Wang;Ke Li;Huanlai Xing","doi":"10.1109/TPDS.2024.3460185","DOIUrl":"10.1109/TPDS.2024.3460185","url":null,"abstract":"By allowing \u0000<inline-formula><tex-math>$p$</tex-math></inline-formula>\u0000 out of \u0000<inline-formula><tex-math>$n$</tex-math></inline-formula>\u0000 workers to conduct \u0000<i>all reduce</i>\u0000 operations among them for a round of synchronization, \u0000<i>partial reduce</i>\u0000, a promising partially-asynchronous variant of \u0000<i>all reduce</i>\u0000, has shown its power in alleviating the impacts of stragglers for iterative distributed machine learning (DML). Current \u0000<i>partial reduce</i>\u0000 solutions are mainly designed for intra-cluster DML, in which workers are networked with high-bandwidth LAN links. Yet no prior work has looked into the problem of how to achieve efficient \u0000<i>partial reduce</i>\u0000 for cross-cloud DML, where inter-worker connections are with scarcely-available capacities. To fill the gap, in this paper, we propose \u0000<small>CREW</small>\u0000, a flexible and efficient implementation of \u0000<i>partial reduce</i>\u0000 for cross-cloud DML. At the high level, \u0000<small>CREW</small>\u0000 is built upon the novel design of employing all active workers along with their internal connection capacities to execute the involved communication and computation tasks; and at the low level, \u0000<small>CREW</small>\u0000 employs a suite of algorithms to distribute the tasks among workers in a load-balanced way, and deal with possible outages of workers/connections, and bandwidth contention. Detailed performance studies confirm that, \u0000<small>CREW</small>\u0000 not only shortens the execution of each \u0000<i>partial reduce</i>\u0000 operation, outperforming existing communication schemes such as PS, Ring, \u0000<small>TopoAdopt</small>\u0000, and BLINK greatly, but also significantly accelerates the training of large models, up to \u0000<inline-formula><tex-math>$15times$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$9times$</tex-math></inline-formula>\u0000, respectively, when compared with the all-to-all direct communication scheme and \u0000<i>original partial reduce</i>\u0000 design.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2224-2238"},"PeriodicalIF":5.6,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Evaluation Framework for Dynamic Thermal Management Strategies in 3D MultiProcessor System-on-Chip Co-Design","authors":"Darong Huang;Luis Costero;David Atienza","doi":"10.1109/TPDS.2024.3459414","DOIUrl":"10.1109/TPDS.2024.3459414","url":null,"abstract":"Dynamic thermal management (DTM) has been widely adopted to improve the energy efficiency, reliability, and performance of modern Multi-Processor SoCs (MPSoCs). However, the evolving industry trends and heterogeneous architecture designs have introduced significant challenges in state-of-the-art DTM methods. Specifically, the emergence of heterogeneous design has led to increased localized and non-uniform hotspots, necessitating accurate and responsive DTM strategies. Additionally, the increased number of cores to be managed requires the DTM to optimize and coordinate the whole system. However, existing methodologies fail in both precise thermal modeling in localized hotspots and fast architecture simulation. To tackle these existing challenges, we first introduce the latest version of 3D-ICE 3.1, with a novel non-uniform thermal modeling technique to support customized discretization levels of thermal grids. 3D-ICE 3.1 improves the accuracy of thermal analysis and reduces simulation overhead. Then, in conjunction with an efficient and fast offline application profiling strategy utilizing the architecture simulator gem5-X, we propose a novel DTM evaluation framework. This framework enables us to explore novel DTM methods to optimize the energy efficiency, reliability, and performance of contemporary 3D MPSoCs. The experimental results demonstrate that 3D-ICE 3.1 achieves high accuracy, with only 0.3K mean temperature error. Subsequently, we evaluate various DTM methods and propose a Multi-Agent Reinforcement Learning (MARL) control to address the demanding thermal challenges of 3D MPSoCs. Our experimental results show that the proposed DTM method based on MARL can reduce power consumption by 13% while maintaining a similar performance level to the comparison methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2161-2176"},"PeriodicalIF":5.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks","authors":"Hui Dou;Yilun Wang;Yiwen Zhang;Pengfei Chen;Zibin Zheng","doi":"10.1109/TPDS.2024.3459889","DOIUrl":"10.1109/TPDS.2024.3459889","url":null,"abstract":"Big data frameworks usually provide a large number of performance-related parameters. Online auto-tuning these parameters based on deep reinforcement learning (DRL) to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time cost during the online tuning phase of conventional DRL-based methods is still heavy, especially for Big Data applications. Therefore, in this paper, we propose DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000, a low-cost and transferrable deep reinforcement learning-based approach to achieve online configuration auto-tuning for Big Data frameworks. To reduce the total online tuning cost and increase the adaptability: 1) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation tradeoff; 4) Furthermore, DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 also implements an Online Continual Learner module based on Progressive Neural Networks to transfer knowledge from historical tuning experiences. Experimental results based on a lab Spark cluster with HiBench benchmark applications show that DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 is able to speed up the best execution time by a factor of 1.49×, 1.63× and 1.65× on average respectively over the baselines, while consuming up to 50.08%, 53.39% and 70.79% less total tuning time. In addition, DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 also has a strong adaptability to the time-varying environment of Big Data frameworks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2114-2131"},"PeriodicalIF":5.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biao Hou;Song Yang;Fan Li;Liehuang Zhu;Lei Jiao;Xu Chen;Xiaoming Fu
{"title":"Gamora: Learning-Based Buffer-Aware Preloading for Adaptive Short Video Streaming","authors":"Biao Hou;Song Yang;Fan Li;Liehuang Zhu;Lei Jiao;Xu Chen;Xiaoming Fu","doi":"10.1109/TPDS.2024.3456567","DOIUrl":"10.1109/TPDS.2024.3456567","url":null,"abstract":"Nowadays, the emerging short video streaming applications have gained substantial attention. With the rapidly burgeoning demand for short video streaming services, maximizing their Quality of Experience (QoE) is an onerous challenge. Current video preloading algorithms cannot determine video preloading sequence decisions appropriately due to the impact of users’ swipes and bandwidth fluctuations. As a result, it is still ambiguous how to improve the overall QoE while mitigating bandwidth wastage to optimize short video streaming services. In this article, we devise Gamora, a buffer-aware short video streaming system to provide a high QoE of users. In Gamora, we first propose an unordered preloading algorithm that utilizes a Deep Reinforcement Learning (DRL) algorithm to make video preloading decisions. Then, we further devise an Asymmetric Imitation Learning (AIL) algorithm to guide the DRL-based preloading algorithm, which enables the agent to learn from expert demonstrations for fast convergence. Finally, we implement our proposed short video streaming system prototype and evaluate the performance of Gamora on various real-world network datasets. Our results demonstrate that Gamora significantly achieves QoE improvement by 28.7%–51.4% compared to state-of-the-art algorithms, while mitigating bandwidth wastage by 40.7%–83.2% without sacrificing video quality.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2132-2146"},"PeriodicalIF":5.6,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}