Two Sides of a Coin: Optimizing the Schedule of MapReduce Jobs to Minimize Their Makespan and Improve Cluster Performance

2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems Pub Date : 2012-06-21 DOI:10.1109/MASCOTS.2012.12

Abhishek Verma, L. Cherkasova, R. Campbell

{"title":"Two Sides of a Coin: Optimizing the Schedule of MapReduce Jobs to Minimize Their Makespan and Improve Cluster Performance","authors":"Abhishek Verma, L. Cherkasova, R. Campbell","doi":"10.1109/MASCOTS.2012.12","DOIUrl":null,"url":null,"abstract":"Large-scale MapReduce clusters that routinely process petabytes of unstructured and semi-structured data represent a new entity in the changing landscape of clouds. A key challenge is to increase the utilization of these MapReduce clusters. In this work, we consider a subset of the production workload that consists of MapReduce jobs with no dependencies. We observe that the order in which these jobs are executed can have a significant impact on their overall completion time and the cluster resource utilization. Our goal is to automate the design of a job schedule that minimizes the completion time (makespan) of such a set of MapReduce jobs. We offer a novel abstraction framework and a heuristic, called BalancedPools, that efficiently utilizes performance properties of MapReduce jobs in a given workload for constructing an optimized job schedule. Simulations performed over a realistic workload demonstrate that 15%-38% makespan improvements are achievable by simply processing the jobs in the right order.","PeriodicalId":278764,"journal":{"name":"2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"109","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MASCOTS.2012.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 109

Abstract

Large-scale MapReduce clusters that routinely process petabytes of unstructured and semi-structured data represent a new entity in the changing landscape of clouds. A key challenge is to increase the utilization of these MapReduce clusters. In this work, we consider a subset of the production workload that consists of MapReduce jobs with no dependencies. We observe that the order in which these jobs are executed can have a significant impact on their overall completion time and the cluster resource utilization. Our goal is to automate the design of a job schedule that minimizes the completion time (makespan) of such a set of MapReduce jobs. We offer a novel abstraction framework and a heuristic, called BalancedPools, that efficiently utilizes performance properties of MapReduce jobs in a given workload for constructing an optimized job schedule. Simulations performed over a realistic workload demonstrate that 15%-38% makespan improvements are achievable by simply processing the jobs in the right order.

查看原文本刊更多论文

一枚硬币的两面:优化MapReduce作业的调度以最小化其Makespan并提高集群性能

大规模MapReduce集群通常处理数拍字节的非结构化和半结构化数据，在不断变化的云环境中代表了一个新的实体。一个关键的挑战是提高这些MapReduce集群的利用率。在这项工作中，我们考虑了生产工作负载的一个子集，它由没有依赖关系的MapReduce作业组成。我们观察到，执行这些作业的顺序会对它们的总体完成时间和集群资源利用率产生重大影响。我们的目标是自动化作业计划的设计，使一组MapReduce作业的完成时间(makespan)最小化。我们提供了一个新颖的抽象框架和启发式方法，称为BalancedPools，它有效地利用了给定工作负载下MapReduce作业的性能属性来构建优化的作业调度。在实际工作负载上进行的模拟表明，只要按正确的顺序处理作业，就可以实现15%-38%的完工时间改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems

自引率

0.00%

发文量