{"title":"通过交叉重叠资源利用减少DAG调度的最大时间跨度","authors":"Yubin Duan, Ning Wang, Jie Wu","doi":"10.1109/MASS50613.2020.00055","DOIUrl":null,"url":null,"abstract":"As data center clusters need to process quintillion bytes of data per day, it becomes a critical problem that efficiently scheduling jobs to improve resource utilization. However, the data analysis job usually contains multiple stages with dependent relationships, which brings challenges for scheduling. Those stages are modeled as Directed Acyclic Graphs (DAGs) and the general DAG scheduling problem is NP-hard. In this paper, we notice that in some parallel computing frameworks such as Spark, the execution of each stage could be divided into multiple phases that use different resources. We observe that interleaving different resources in a pipelined manner could improve resource utilization. Based on this observation, we propose to minimize the job makespan by exploiting resource pipeline. We first theoretically analyze the scheduling for perfectly parallel stages. In this case, our scheduling problem is equivalent to a DAG shop problem which is NP-hard. A contention-free scheduler is proposed and its approximation properties are analyzed. Stages of real-world jobs are usually not perfectly parallel. For general jobs, a reinforcement learning (RL) based scheduler is proposed to adaptively adjust the resource contention. We evaluate our contention-free and RL-based schedulers on a Spark cluster deployed on the Amazon EC2. Experiments on real-world and synthetic datasets show our RL-based scheduler can improve the CPU and network utilization by 33.0% and 29.7%, respectively.","PeriodicalId":105795,"journal":{"name":"2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Reducing Makespans of DAG Scheduling through Interleaving Overlapping Resource Utilization\",\"authors\":\"Yubin Duan, Ning Wang, Jie Wu\",\"doi\":\"10.1109/MASS50613.2020.00055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As data center clusters need to process quintillion bytes of data per day, it becomes a critical problem that efficiently scheduling jobs to improve resource utilization. However, the data analysis job usually contains multiple stages with dependent relationships, which brings challenges for scheduling. Those stages are modeled as Directed Acyclic Graphs (DAGs) and the general DAG scheduling problem is NP-hard. In this paper, we notice that in some parallel computing frameworks such as Spark, the execution of each stage could be divided into multiple phases that use different resources. We observe that interleaving different resources in a pipelined manner could improve resource utilization. Based on this observation, we propose to minimize the job makespan by exploiting resource pipeline. We first theoretically analyze the scheduling for perfectly parallel stages. In this case, our scheduling problem is equivalent to a DAG shop problem which is NP-hard. A contention-free scheduler is proposed and its approximation properties are analyzed. Stages of real-world jobs are usually not perfectly parallel. For general jobs, a reinforcement learning (RL) based scheduler is proposed to adaptively adjust the resource contention. We evaluate our contention-free and RL-based schedulers on a Spark cluster deployed on the Amazon EC2. Experiments on real-world and synthetic datasets show our RL-based scheduler can improve the CPU and network utilization by 33.0% and 29.7%, respectively.\",\"PeriodicalId\":105795,\"journal\":{\"name\":\"2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)\",\"volume\":\"105 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MASS50613.2020.00055\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 17th International Conference on Mobile Ad Hoc and Sensor Systems (MASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASS50613.2020.00055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Reducing Makespans of DAG Scheduling through Interleaving Overlapping Resource Utilization
As data center clusters need to process quintillion bytes of data per day, it becomes a critical problem that efficiently scheduling jobs to improve resource utilization. However, the data analysis job usually contains multiple stages with dependent relationships, which brings challenges for scheduling. Those stages are modeled as Directed Acyclic Graphs (DAGs) and the general DAG scheduling problem is NP-hard. In this paper, we notice that in some parallel computing frameworks such as Spark, the execution of each stage could be divided into multiple phases that use different resources. We observe that interleaving different resources in a pipelined manner could improve resource utilization. Based on this observation, we propose to minimize the job makespan by exploiting resource pipeline. We first theoretically analyze the scheduling for perfectly parallel stages. In this case, our scheduling problem is equivalent to a DAG shop problem which is NP-hard. A contention-free scheduler is proposed and its approximation properties are analyzed. Stages of real-world jobs are usually not perfectly parallel. For general jobs, a reinforcement learning (RL) based scheduler is proposed to adaptively adjust the resource contention. We evaluate our contention-free and RL-based schedulers on a Spark cluster deployed on the Amazon EC2. Experiments on real-world and synthetic datasets show our RL-based scheduler can improve the CPU and network utilization by 33.0% and 29.7%, respectively.