Reducing Makespans of DAG Scheduling through Interleaving Overlapping Resource Utilization

Yubin Duan, Ning Wang, Jie Wu
{"title":"Reducing Makespans of DAG Scheduling through Interleaving Overlapping Resource Utilization","authors":"Yubin Duan, Ning Wang, Jie Wu","doi":"10.1109/MASS50613.2020.00055","DOIUrl":null,"url":null,"abstract":"As data center clusters need to process quintillion bytes of data per day, it becomes a critical problem that efficiently scheduling jobs to improve resource utilization. However, the data analysis job usually contains multiple stages with dependent relationships, which brings challenges for scheduling. Those stages are modeled as Directed Acyclic Graphs (DAGs) and the general DAG scheduling problem is NP-hard. In this paper, we notice that in some parallel computing frameworks such as Spark, the execution of each stage could be divided into multiple phases that use different resources. We observe that interleaving different resources in a pipelined manner could improve resource utilization. Based on this observation, we propose to minimize the job makespan by exploiting resource pipeline. We first theoretically analyze the scheduling for perfectly parallel stages. In this case, our scheduling problem is equivalent to a DAG shop problem which is NP-hard. A contention-free scheduler is proposed and its approximation properties are analyzed. Stages of real-world jobs are usually not perfectly parallel. For general jobs, a reinforcement learning (RL) based scheduler is proposed to adaptively adjust the resource contention. We evaluate our contention-free and RL-based schedulers on a Spark cluster deployed on the Amazon EC2. Experiments on real-world and synthetic datasets show our RL-based scheduler can improve the CPU and network utilization by 33.0% and 29.7%, respectively.","PeriodicalId":105795,"journal":{"name":"2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASS50613.2020.00055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

As data center clusters need to process quintillion bytes of data per day, it becomes a critical problem that efficiently scheduling jobs to improve resource utilization. However, the data analysis job usually contains multiple stages with dependent relationships, which brings challenges for scheduling. Those stages are modeled as Directed Acyclic Graphs (DAGs) and the general DAG scheduling problem is NP-hard. In this paper, we notice that in some parallel computing frameworks such as Spark, the execution of each stage could be divided into multiple phases that use different resources. We observe that interleaving different resources in a pipelined manner could improve resource utilization. Based on this observation, we propose to minimize the job makespan by exploiting resource pipeline. We first theoretically analyze the scheduling for perfectly parallel stages. In this case, our scheduling problem is equivalent to a DAG shop problem which is NP-hard. A contention-free scheduler is proposed and its approximation properties are analyzed. Stages of real-world jobs are usually not perfectly parallel. For general jobs, a reinforcement learning (RL) based scheduler is proposed to adaptively adjust the resource contention. We evaluate our contention-free and RL-based schedulers on a Spark cluster deployed on the Amazon EC2. Experiments on real-world and synthetic datasets show our RL-based scheduler can improve the CPU and network utilization by 33.0% and 29.7%, respectively.
通过交叉重叠资源利用减少DAG调度的最大时间跨度
由于数据中心集群每天需要处理千万亿字节的数据,因此如何有效地调度作业以提高资源利用率成为一个关键问题。然而,数据分析作业通常包含多个相互依赖的阶段,这给调度带来了挑战。这些阶段被建模为有向无环图(DAG),一般的DAG调度问题是np困难的。在本文中,我们注意到在一些并行计算框架(如Spark)中,每个阶段的执行可以分为使用不同资源的多个阶段。我们观察到,以流水线方式交错不同的资源可以提高资源利用率。基于这一观察,我们建议通过利用资源管道来最小化作业完工时间。首先从理论上分析了完全并行阶段的调度问题。在这种情况下,我们的调度问题相当于一个np困难的DAG车间问题。提出了一种无争用调度程序,并分析了它的近似性质。现实工作的各个阶段通常不是完全平行的。对于一般作业,提出了一种基于强化学习(RL)的调度程序来自适应调整资源争用。我们在部署在Amazon EC2上的Spark集群上评估了无争用和基于rl的调度器。在真实数据集和合成数据集上的实验表明,基于rl的调度器可以将CPU和网络利用率分别提高33.0%和29.7%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信