IEEE Transactions on Parallel and Distributed Systems最新文献

筛选
英文 中文
Efficient Distributed Edge Computing for Dependent Delay-Sensitive Tasks in Multi-Operator Multi-Access Networks 在多运营商多接入网络中针对依赖性延迟敏感任务的高效分布式边缘计算
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-26 DOI: 10.1109/TPDS.2024.3468892
Alia Asheralieva;Dusit Niyato;Xuetao Wei
{"title":"Efficient Distributed Edge Computing for Dependent Delay-Sensitive Tasks in Multi-Operator Multi-Access Networks","authors":"Alia Asheralieva;Dusit Niyato;Xuetao Wei","doi":"10.1109/TPDS.2024.3468892","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3468892","url":null,"abstract":"We study the problem of distributed computing in the \u0000<italic>multi-operator multi-access edge computing</i>\u0000 (MEC) network for \u0000<italic>dependent tasks</i>\u0000. Every task comprises several \u0000<italic>sub-tasks</i>\u0000 which are executed based on logical precedence modelled as a \u0000<italic>directed acyclic graph</i>\u0000. In the graph, each vertex is a sub-task, each edge – precedence constraint, such that a sub-task can only be started after all its preceding sub-tasks are completed. Tasks are executed by MEC servers with the assistance of nearby edge devices, so that the MEC network can be viewed as a \u0000<italic>distributed</i>\u0000 “\u0000<italic>primary-secondary node</i>\u0000” system where each MEC server acts as a \u0000<italic>primary node</i>\u0000 (PN) deciding on sub-tasks assigned to its \u0000<italic>secondary nodes</i>\u0000 (SNs), i.e., nearby edge devices. The PN's decision problem is complex, as its SNs can be associated with other \u0000<italic>neighboring</i>\u0000 PNs. In this case, the available processing resources of SNs depend on the sub-task assignment decisions of all neighboring PNs. Since PNs are controlled by different operators, they do not coordinate their decisions, and each PN is uncertain about the sub-task assignments of its neighbors (and, thus, the available resources of its SNs). To address this problem, we propose a novel framework based on a \u0000<italic>graphical Bayesian game</i>\u0000, where PNs play under uncertainty about their neighbors’ decisions. We prove that the game has a \u0000<italic>perfect Bayesian equilibrium</i>\u0000 (PBE) yielding \u0000<italic>unique optimal values</i>\u0000, and formulate new \u0000<italic>Bayesian reinforcement learning</i>\u0000 and \u0000<italic>Bayesian deep reinforcement learning</i>\u0000 algorithms enabling each PN to reach the PBE autonomously (without communicating with other PNs).","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2559-2577"},"PeriodicalIF":5.6,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Schedule Construction for Distributed Execution of Large DNN Models 高效构建大型 DNN 模型的分布式执行时间表
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-24 DOI: 10.1109/TPDS.2024.3466913
Zhiqi Lin;Youshan Miao;Guanbin Xu;Cheng Li;Olli Saarikivi;Saeed Maleki;Fan Yang
{"title":"Efficient Schedule Construction for Distributed Execution of Large DNN Models","authors":"Zhiqi Lin;Youshan Miao;Guanbin Xu;Cheng Li;Olli Saarikivi;Saeed Maleki;Fan Yang","doi":"10.1109/TPDS.2024.3466913","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3466913","url":null,"abstract":"Increasingly complex and diverse deep neural network (DNN) models necessitate distributing the execution across multiple devices for training and inference tasks, and also require carefully planned schedules for performance. However, existing practices often rely on predefined schedules that may not fully exploit the benefits of emerging diverse model-aware operator placement strategies. Handcrafting high-efficiency schedules can be challenging due to the large and varying schedule space. This paper presents Tessel, an automated system that searches for efficient schedules for distributed DNN training and inference for diverse operator placement strategies. To reduce search costs, Tessel leverages the insight that the most efficient schedules often exhibit repetitive pattern (\u0000<italic>repetend</i>\u0000) across different data inputs. This leads to a two-phase approach: repetend construction and schedule completion. By exploring schedules for various operator placement strategies, Tessel significantly improves both training and inference performance. Experiments with representative DNN models demonstrate that Tessel achieves up to 5.5× training performance speedup and up to 38% inference latency reduction.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2375-2391"},"PeriodicalIF":5.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Two-Timescale Joint Optimization of Task Scheduling and Resource Scaling in Multi-Data Center System Based on Multi-Agent Deep Reinforcement Learning 基于多代理深度强化学习的多数据中心系统中任务调度和资源规模的双时标联合优化
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-24 DOI: 10.1109/TPDS.2024.3467212
Shuangwu Chen;Jiangming Li;Qifeng Yuan;Huasen He;Sen Li;Jian Yang
{"title":"Two-Timescale Joint Optimization of Task Scheduling and Resource Scaling in Multi-Data Center System Based on Multi-Agent Deep Reinforcement Learning","authors":"Shuangwu Chen;Jiangming Li;Qifeng Yuan;Huasen He;Sen Li;Jian Yang","doi":"10.1109/TPDS.2024.3467212","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3467212","url":null,"abstract":"As a new computing paradigm, multi-data center computing enables service providers to deploy their applications close to the users. However, due to the spatio-temporal changes in workloads, it is challenging to coordinate multiple distributed data centers to provide high-quality services while reducing service operation costs. To address this challenge, this article studies the joint optimization problem of task scheduling and resource scaling in multi-data center systems. Since the task scheduling and the resource scaling are usually performed in different timescales, we decompose the joint optimization problem into two sub-problems and propose a two-timescale optimization framework. The short-timescale task scheduling can promptly relieve the bursty arrivals of computing tasks, and the long-timescale resource scaling can adapt well to the long-term changes in workloads. To address the distributed optimization problem, we propose a two-timescale multi-agent deep reinforcement learning algorithm. In order to characterize the graph-structured states of connected data centers, we develop a directed graph convolutional network based global state representation model. The evaluation indicates that the proposed algorithm is able to reduce both the task makespan and the task timeout while maintaining a reasonable cost.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2331-2346"},"PeriodicalIF":5.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142438540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VisionAGILE: A Versatile Domain-Specific Accelerator for Computer Vision Tasks VisionAGILE:用于计算机视觉任务的多功能特定领域加速器
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-24 DOI: 10.1109/TPDS.2024.3466891
Bingyi Zhang;Rajgopal Kannan;Carl Busart;Viktor K. Prasanna
{"title":"VisionAGILE: A Versatile Domain-Specific Accelerator for Computer Vision Tasks","authors":"Bingyi Zhang;Rajgopal Kannan;Carl Busart;Viktor K. Prasanna","doi":"10.1109/TPDS.2024.3466891","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3466891","url":null,"abstract":"The emergence of diverse machine learning (ML) models has led to groundbreaking revolutions in computer vision (CV). These ML models include convolutional neural networks (CNNs), graph neural networks (GNNs), and vision transformers (ViTs). However, existing hardware accelerators designed for CV lack the versatility to support various ML models, potentially limiting their applicability to real-world scenarios. To address this limitation, we introduce VisionAGILE, a domain-specific accelerator designed to be versatile and capable of accommodating a range of ML models, including CNNs, GNNs, and ViTs. VisionAGILE comprises a compiler, a runtime system, and a hardware accelerator. For the hardware accelerator, we develop a novel unified architecture with a flexible data path and memory organization to support the computation primitives in various ML models. Regarding the compiler design, we develop a unified compilation workflow that maps various ML models to the proposed hardware accelerator. The runtime system executes dynamic sparsity exploitation to reduce inference latency and dynamic task scheduling for workload balance. The compiler, the runtime system, and the hardware accelerator work synergistically to support a variety of ML models in CV, enabling low-latency inference. We deploy the hardware accelerator on a state-of-the-art data center FPGA (Xilinx Alveo U250). We evaluate VisionAGILE on diverse ML models for CV, including CNNs, GNNs, hybrid models (comprising both CNN and GNN), and ViTs. The experimental results indicate that, compared with state-of-the-art CPU (GPU) implementations, VisionAGILE achieves a speedup of \u0000<inline-formula><tex-math>$81.7times$</tex-math></inline-formula>\u0000 (\u0000<inline-formula><tex-math>$4.8times$</tex-math></inline-formula>\u0000) in terms of latency. Evaluated on standalone CNNs, GNNs, and ViTs, VisionAGILE demonstrates comparable or higher performance with state-of-the-art CNN accelerators, GNN accelerators, and ViT accelerators, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2405-2422"},"PeriodicalIF":5.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142447219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment 利用整体稀疏性对齐在移动设备上高效推断剪枝 CNN 模型
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-17 DOI: 10.1109/TPDS.2024.3462092
Yuyang Jin;Runxin Zhong;Saiqin Long;Jidong Zhai
{"title":"Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment","authors":"Yuyang Jin;Runxin Zhong;Saiqin Long;Jidong Zhai","doi":"10.1109/TPDS.2024.3462092","DOIUrl":"10.1109/TPDS.2024.3462092","url":null,"abstract":"Many artificial intelligence applications based on convolutional neural networks are directly deployed on mobile devices to avoid network unavailability and user privacy leakage. However, the significant increase in model parameter volumes makes it difficult to achieve high-performance convolutional neural network inference on these mobile devices with limited computing power. Weight pruning is one of the main approaches to compress models by reducing model parameters and computational operations, which also introduces irregular sparsity of neural networks, leading to inefficient computation and memory access during inference. This work proposes an end-to-end framework, namely MCPruner, for efficient inference of pruned convolutional neural networks on mobile devices by aligning the sparse patterns with hardware execution features in computation, memory access, and parallelism. It first co-designs pruning methods and code generation optimizations for the alignment of non-zero weight count and vector width, to improve computational efficiency while ensuring accuracy. During the code generation, it applies a sparse pattern-aware format to reduce inefficient memory accesses. Besides, convolution computations are reordered for alignment, and then mapped to parallel threads on accelerated units to achieve high parallelism. Experimental results using several commonly used models and datasets on the ARM-based Hikey970 demonstrate that our work outperforms state-of-the-art methods in inference efficiency, with no accuracy degradation.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2208-2223"},"PeriodicalIF":5.6,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Freyr $^+$+: Harvesting Idle Resources in Serverless Computing via Deep Reinforcement Learning Freyr$^+$:通过深度强化学习挖掘无服务器计算中的闲置资源
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-17 DOI: 10.1109/TPDS.2024.3462294
Hanfei Yu;Hao Wang;Jian Li;Xu Yuan;Seung-Jong Park
{"title":"Freyr $^+$+: Harvesting Idle Resources in Serverless Computing via Deep Reinforcement Learning","authors":"Hanfei Yu;Hao Wang;Jian Li;Xu Yuan;Seung-Jong Park","doi":"10.1109/TPDS.2024.3462294","DOIUrl":"10.1109/TPDS.2024.3462294","url":null,"abstract":"Serverless computing has revolutionized online service development and deployment with ease-to-use operations, auto-scaling, fine-grained resource allocation, and pay-as-you-go pricing. However, a gap remains in configuring serverless functions—the actual resource consumption may vary due to function types, dependencies, and input data sizes, thus mismatching the static resource configuration by users. Dynamic resource consumption against static configuration may lead to either poor function execution performance or low utilization. This paper proposes \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000, a novel resource manager (RM) that dynamically harvests idle resources from over-provisioned functions to accelerate under-provisioned functions for serverless platforms. \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 monitors each function's resource utilization in real-time and detects the mismatches between user configuration and actual resource consumption. We design deep reinforcement learning (DRL) algorithms with attention-enhanced embedding, incremental learning, and safeguard mechanism for \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 to harvest idle resources safely and accelerate functions efficiently. We have implemented and deployed a \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 prototype in a 13-node Apache OpenWhisk cluster using AWS EC2. \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 is evaluated on both large-scale simulation and real-world testbed. Experimental results show that \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 harvests 38% of function invocations’ idle resources and accelerates 39% of invocations using harvested resources. \u0000<i>Freyr</i>\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 reduces the 99th-percentile function response latency by 26% compared to the baseline RMs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2254-2269"},"PeriodicalIF":5.6,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Cross-Cloud Partial Reduce With CREW 利用 CREW 实现高效的跨云部分还原
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-13 DOI: 10.1109/TPDS.2024.3460185
Shouxi Luo;Renyi Wang;Ke Li;Huanlai Xing
{"title":"Efficient Cross-Cloud Partial Reduce With CREW","authors":"Shouxi Luo;Renyi Wang;Ke Li;Huanlai Xing","doi":"10.1109/TPDS.2024.3460185","DOIUrl":"10.1109/TPDS.2024.3460185","url":null,"abstract":"By allowing \u0000<inline-formula><tex-math>$p$</tex-math></inline-formula>\u0000 out of \u0000<inline-formula><tex-math>$n$</tex-math></inline-formula>\u0000 workers to conduct \u0000<i>all reduce</i>\u0000 operations among them for a round of synchronization, \u0000<i>partial reduce</i>\u0000, a promising partially-asynchronous variant of \u0000<i>all reduce</i>\u0000, has shown its power in alleviating the impacts of stragglers for iterative distributed machine learning (DML). Current \u0000<i>partial reduce</i>\u0000 solutions are mainly designed for intra-cluster DML, in which workers are networked with high-bandwidth LAN links. Yet no prior work has looked into the problem of how to achieve efficient \u0000<i>partial reduce</i>\u0000 for cross-cloud DML, where inter-worker connections are with scarcely-available capacities. To fill the gap, in this paper, we propose \u0000<small>CREW</small>\u0000, a flexible and efficient implementation of \u0000<i>partial reduce</i>\u0000 for cross-cloud DML. At the high level, \u0000<small>CREW</small>\u0000 is built upon the novel design of employing all active workers along with their internal connection capacities to execute the involved communication and computation tasks; and at the low level, \u0000<small>CREW</small>\u0000 employs a suite of algorithms to distribute the tasks among workers in a load-balanced way, and deal with possible outages of workers/connections, and bandwidth contention. Detailed performance studies confirm that, \u0000<small>CREW</small>\u0000 not only shortens the execution of each \u0000<i>partial reduce</i>\u0000 operation, outperforming existing communication schemes such as PS, Ring, \u0000<small>TopoAdopt</small>\u0000, and BLINK greatly, but also significantly accelerates the training of large models, up to \u0000<inline-formula><tex-math>$15times$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>$9times$</tex-math></inline-formula>\u0000, respectively, when compared with the all-to-all direct communication scheme and \u0000<i>original partial reduce</i>\u0000 design.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2224-2238"},"PeriodicalIF":5.6,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142264435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Evaluation Framework for Dynamic Thermal Management Strategies in 3D MultiProcessor System-on-Chip Co-Design 三维多处理器片上系统协同设计中动态热管理策略的评估框架
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-12 DOI: 10.1109/TPDS.2024.3459414
Darong Huang;Luis Costero;David Atienza
{"title":"An Evaluation Framework for Dynamic Thermal Management Strategies in 3D MultiProcessor System-on-Chip Co-Design","authors":"Darong Huang;Luis Costero;David Atienza","doi":"10.1109/TPDS.2024.3459414","DOIUrl":"10.1109/TPDS.2024.3459414","url":null,"abstract":"Dynamic thermal management (DTM) has been widely adopted to improve the energy efficiency, reliability, and performance of modern Multi-Processor SoCs (MPSoCs). However, the evolving industry trends and heterogeneous architecture designs have introduced significant challenges in state-of-the-art DTM methods. Specifically, the emergence of heterogeneous design has led to increased localized and non-uniform hotspots, necessitating accurate and responsive DTM strategies. Additionally, the increased number of cores to be managed requires the DTM to optimize and coordinate the whole system. However, existing methodologies fail in both precise thermal modeling in localized hotspots and fast architecture simulation. To tackle these existing challenges, we first introduce the latest version of 3D-ICE 3.1, with a novel non-uniform thermal modeling technique to support customized discretization levels of thermal grids. 3D-ICE 3.1 improves the accuracy of thermal analysis and reduces simulation overhead. Then, in conjunction with an efficient and fast offline application profiling strategy utilizing the architecture simulator gem5-X, we propose a novel DTM evaluation framework. This framework enables us to explore novel DTM methods to optimize the energy efficiency, reliability, and performance of contemporary 3D MPSoCs. The experimental results demonstrate that 3D-ICE 3.1 achieves high accuracy, with only 0.3K mean temperature error. Subsequently, we evaluate various DTM methods and propose a Multi-Agent Reinforcement Learning (MARL) control to address the demanding thermal challenges of 3D MPSoCs. Our experimental results show that the proposed DTM method based on MARL can reduce power consumption by 13% while maintaining a similar performance level to the comparison methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2161-2176"},"PeriodicalIF":5.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194353","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks DeepCAT+:用于大数据框架的低成本、可转移的在线配置自动调整方法
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-12 DOI: 10.1109/TPDS.2024.3459889
Hui Dou;Yilun Wang;Yiwen Zhang;Pengfei Chen;Zibin Zheng
{"title":"DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks","authors":"Hui Dou;Yilun Wang;Yiwen Zhang;Pengfei Chen;Zibin Zheng","doi":"10.1109/TPDS.2024.3459889","DOIUrl":"10.1109/TPDS.2024.3459889","url":null,"abstract":"Big data frameworks usually provide a large number of performance-related parameters. Online auto-tuning these parameters based on deep reinforcement learning (DRL) to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time cost during the online tuning phase of conventional DRL-based methods is still heavy, especially for Big Data applications. Therefore, in this paper, we propose DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000, a low-cost and transferrable deep reinforcement learning-based approach to achieve online configuration auto-tuning for Big Data frameworks. To reduce the total online tuning cost and increase the adaptability: 1) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation tradeoff; 4) Furthermore, DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 also implements an Online Continual Learner module based on Progressive Neural Networks to transfer knowledge from historical tuning experiences. Experimental results based on a lab Spark cluster with HiBench benchmark applications show that DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 is able to speed up the best execution time by a factor of 1.49×, 1.63× and 1.65× on average respectively over the baselines, while consuming up to 50.08%, 53.39% and 70.79% less total tuning time. In addition, DeepCAT\u0000<inline-formula><tex-math>$^+$</tex-math></inline-formula>\u0000 also has a strong adaptability to the time-varying environment of Big Data frameworks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2114-2131"},"PeriodicalIF":5.6,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gamora: Learning-Based Buffer-Aware Preloading for Adaptive Short Video Streaming 卡魔拉基于学习的缓冲区感知预加载,实现自适应短视频流
IF 5.6 2区 计算机科学
IEEE Transactions on Parallel and Distributed Systems Pub Date : 2024-09-09 DOI: 10.1109/TPDS.2024.3456567
Biao Hou;Song Yang;Fan Li;Liehuang Zhu;Lei Jiao;Xu Chen;Xiaoming Fu
{"title":"Gamora: Learning-Based Buffer-Aware Preloading for Adaptive Short Video Streaming","authors":"Biao Hou;Song Yang;Fan Li;Liehuang Zhu;Lei Jiao;Xu Chen;Xiaoming Fu","doi":"10.1109/TPDS.2024.3456567","DOIUrl":"10.1109/TPDS.2024.3456567","url":null,"abstract":"Nowadays, the emerging short video streaming applications have gained substantial attention. With the rapidly burgeoning demand for short video streaming services, maximizing their Quality of Experience (QoE) is an onerous challenge. Current video preloading algorithms cannot determine video preloading sequence decisions appropriately due to the impact of users’ swipes and bandwidth fluctuations. As a result, it is still ambiguous how to improve the overall QoE while mitigating bandwidth wastage to optimize short video streaming services. In this article, we devise Gamora, a buffer-aware short video streaming system to provide a high QoE of users. In Gamora, we first propose an unordered preloading algorithm that utilizes a Deep Reinforcement Learning (DRL) algorithm to make video preloading decisions. Then, we further devise an Asymmetric Imitation Learning (AIL) algorithm to guide the DRL-based preloading algorithm, which enables the agent to learn from expert demonstrations for fast convergence. Finally, we implement our proposed short video streaming system prototype and evaluate the performance of Gamora on various real-world network datasets. Our results demonstrate that Gamora significantly achieves QoE improvement by 28.7%–51.4% compared to state-of-the-art algorithms, while mitigating bandwidth wastage by 40.7%–83.2% without sacrificing video quality.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 11","pages":"2132-2146"},"PeriodicalIF":5.6,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142194355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信