Reducing Makespans of DAG Scheduling through Interleaving Overlapping Resource Utilization

2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS) Pub Date : 2020-12-01 DOI:10.1109/MASS50613.2020.00055

Yubin Duan, Ning Wang, Jie Wu

{"title":"Reducing Makespans of DAG Scheduling through Interleaving Overlapping Resource Utilization","authors":"Yubin Duan, Ning Wang, Jie Wu","doi":"10.1109/MASS50613.2020.00055","DOIUrl":null,"url":null,"abstract":"As data center clusters need to process quintillion bytes of data per day, it becomes a critical problem that efficiently scheduling jobs to improve resource utilization. However, the data analysis job usually contains multiple stages with dependent relationships, which brings challenges for scheduling. Those stages are modeled as Directed Acyclic Graphs (DAGs) and the general DAG scheduling problem is NP-hard. In this paper, we notice that in some parallel computing frameworks such as Spark, the execution of each stage could be divided into multiple phases that use different resources. We observe that interleaving different resources in a pipelined manner could improve resource utilization. Based on this observation, we propose to minimize the job makespan by exploiting resource pipeline. We first theoretically analyze the scheduling for perfectly parallel stages. In this case, our scheduling problem is equivalent to a DAG shop problem which is NP-hard. A contention-free scheduler is proposed and its approximation properties are analyzed. Stages of real-world jobs are usually not perfectly parallel. For general jobs, a reinforcement learning (RL) based scheduler is proposed to adaptively adjust the resource contention. We evaluate our contention-free and RL-based schedulers on a Spark cluster deployed on the Amazon EC2. Experiments on real-world and synthetic datasets show our RL-based scheduler can improve the CPU and network utilization by 33.0% and 29.7%, respectively.","PeriodicalId":105795,"journal":{"name":"2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASS50613.2020.00055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

As data center clusters need to process quintillion bytes of data per day, it becomes a critical problem that efficiently scheduling jobs to improve resource utilization. However, the data analysis job usually contains multiple stages with dependent relationships, which brings challenges for scheduling. Those stages are modeled as Directed Acyclic Graphs (DAGs) and the general DAG scheduling problem is NP-hard. In this paper, we notice that in some parallel computing frameworks such as Spark, the execution of each stage could be divided into multiple phases that use different resources. We observe that interleaving different resources in a pipelined manner could improve resource utilization. Based on this observation, we propose to minimize the job makespan by exploiting resource pipeline. We first theoretically analyze the scheduling for perfectly parallel stages. In this case, our scheduling problem is equivalent to a DAG shop problem which is NP-hard. A contention-free scheduler is proposed and its approximation properties are analyzed. Stages of real-world jobs are usually not perfectly parallel. For general jobs, a reinforcement learning (RL) based scheduler is proposed to adaptively adjust the resource contention. We evaluate our contention-free and RL-based schedulers on a Spark cluster deployed on the Amazon EC2. Experiments on real-world and synthetic datasets show our RL-based scheduler can improve the CPU and network utilization by 33.0% and 29.7%, respectively.

查看原文本刊更多论文

通过交叉重叠资源利用减少DAG调度的最大时间跨度

由于数据中心集群每天需要处理千万亿字节的数据，因此如何有效地调度作业以提高资源利用率成为一个关键问题。然而，数据分析作业通常包含多个相互依赖的阶段，这给调度带来了挑战。这些阶段被建模为有向无环图(DAG)，一般的DAG调度问题是np困难的。在本文中，我们注意到在一些并行计算框架(如Spark)中，每个阶段的执行可以分为使用不同资源的多个阶段。我们观察到，以流水线方式交错不同的资源可以提高资源利用率。基于这一观察，我们建议通过利用资源管道来最小化作业完工时间。首先从理论上分析了完全并行阶段的调度问题。在这种情况下，我们的调度问题相当于一个np困难的DAG车间问题。提出了一种无争用调度程序，并分析了它的近似性质。现实工作的各个阶段通常不是完全平行的。对于一般作业，提出了一种基于强化学习(RL)的调度程序来自适应调整资源争用。我们在部署在Amazon EC2上的Spark集群上评估了无争用和基于rl的调度器。在真实数据集和合成数据集上的实验表明，基于rl的调度器可以将CPU和网络利用率分别提高33.0%和29.7%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)

自引率

0.00%

发文量