A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment

2012 IEEE Fifth International Conference on Utility and Cloud Computing Pub Date : 2012-11-01 DOI:10.1109/UCC.2012.32

P. Nguyen, T. Simon, M. Halem, David Chapman, Quang-Trai Le

{"title":"A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment","authors":"P. Nguyen, T. Simon, M. Halem, David Chapman, Quang-Trai Le","doi":"10.1109/UCC.2012.32","DOIUrl":null,"url":null,"abstract":"The specific choice of workload task schedulers for Hadoop MapReduce applications can have a dramatic effect on job workload latency. The Hadoop Fair Scheduler (FairS) assigns resources to jobs such that all jobs get, on average, an equal share of resources over time. Thus, it addresses the problem with a FIFO scheduler when short jobs have to wait for long running jobs to complete. We show that even for the FairS, jobs are still forced to wait significantly when the MapReduce system assigns equal sharing of resources due to dependencies between Map, Shuffle, Sort, Reduce phases. We propose a Hybrid Scheduler (HybS) algorithm based on dynamic priority in order to reduce the latency for variable length concurrent jobs, while maintaining data locality. The dynamic priorities can accommodate multiple task lengths, job sizes, and job waiting times by applying a greedy fractional knapsack algorithm for job task processor assignment. The estimated runtime of Map and Reduce tasks are provided to the HybS dynamic priorities from the historical Hadoop log files. In addition to dynamic priority, we implement a reordering of task processor assignment to account for data availability to automatically maintain the benefits of data locality in this environment. We evaluate our approach by running concurrent workloads consisting of the Word-count and Terasort benchmarks, and a satellite scientific data processing workload and developing a simulator. Our evaluation shows the HybS system improves the average response time for the workloads approximately 2.1x faster over the Hadoop FairS with a standard deviation of 1.4x.","PeriodicalId":122639,"journal":{"name":"2012 IEEE Fifth International Conference on Utility and Cloud Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"43","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE Fifth International Conference on Utility and Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/UCC.2012.32","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 43

Abstract

The specific choice of workload task schedulers for Hadoop MapReduce applications can have a dramatic effect on job workload latency. The Hadoop Fair Scheduler (FairS) assigns resources to jobs such that all jobs get, on average, an equal share of resources over time. Thus, it addresses the problem with a FIFO scheduler when short jobs have to wait for long running jobs to complete. We show that even for the FairS, jobs are still forced to wait significantly when the MapReduce system assigns equal sharing of resources due to dependencies between Map, Shuffle, Sort, Reduce phases. We propose a Hybrid Scheduler (HybS) algorithm based on dynamic priority in order to reduce the latency for variable length concurrent jobs, while maintaining data locality. The dynamic priorities can accommodate multiple task lengths, job sizes, and job waiting times by applying a greedy fractional knapsack algorithm for job task processor assignment. The estimated runtime of Map and Reduce tasks are provided to the HybS dynamic priorities from the historical Hadoop log files. In addition to dynamic priority, we implement a reordering of task processor assignment to account for data availability to automatically maintain the benefits of data locality in this environment. We evaluate our approach by running concurrent workloads consisting of the Word-count and Terasort benchmarks, and a satellite scientific data processing workload and developing a simulator. Our evaluation shows the HybS system improves the average response time for the workloads approximately 2.1x faster over the Hadoop FairS with a standard deviation of 1.4x.

查看原文本刊更多论文

MapReduce环境下数据密集型工作负载的混合调度算法

为Hadoop MapReduce应用程序选择特定的工作负载任务调度器会对工作负载延迟产生巨大影响。Hadoop公平调度程序(Fair Scheduler, Fair)将资源分配给作业，以便所有作业在一段时间内平均获得相等的资源份额。因此，当短作业必须等待长时间运行的作业完成时，它解决了FIFO调度程序的问题。我们表明，即使对于fair，当MapReduce系统分配相等的资源共享时，由于Map、Shuffle、Sort、Reduce阶段之间的依赖关系，作业仍然被迫等待。为了在保持数据局部性的同时减少可变长度并发作业的延迟，提出了一种基于动态优先级的混合调度算法。动态优先级可以适应多个任务长度、作业大小和作业等待时间，方法是对作业任务处理器分配应用贪婪分数背包算法。Map和Reduce任务的估计运行时间从历史Hadoop日志文件中提供给HybS动态优先级。除了动态优先级之外，我们还实现了任务处理器分配的重新排序，以考虑数据可用性，从而在此环境中自动维护数据局部性的好处。我们通过运行由Word-count和Terasort基准组成的并发工作负载、卫星科学数据处理工作负载和开发模拟器来评估我们的方法。我们的评估显示，HybS系统对工作负载的平均响应时间比Hadoop FairS快了大约2.1倍，标准差为1.4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2012 IEEE Fifth International Conference on Utility and Cloud Computing

自引率

0.00%

发文量