CASH: context aware scheduler for Hadoop

International Conference on Advances in Computing, Communications and Informatics Pub Date : 2012-08-03 DOI:10.1145/2345396.2345406

K. A. Kumar, Vamshi Krishna Konishetty, K. Voruganti, G. V. P. Rao

{"title":"CASH: context aware scheduler for Hadoop","authors":"K. A. Kumar, Vamshi Krishna Konishetty, K. Voruganti, G. V. P. Rao","doi":"10.1145/2345396.2345406","DOIUrl":null,"url":null,"abstract":"Hadoop MapReduce infrastructure has been designed to solve problems that can be broken down into tasks that can be solved in parallel. The key reason for MapReduce's popularity is because it can run on commodity hardware and it comes with a job scheduler and task management framework. Thus, the MapReduce framework allows the application programmers to focus on their application program and not on the management infrastructure. Job scheduler is a key component of the MapReduce framework as it controls when and where a job's tasks get executed. However, current MapReduce schedulers assume that the Hadoop cluster is homogeneous in nature. In this paper we show that making the scheduler be aware of and leverage the cluster heterogeneity can improve in the overall throughput of the system.\n The design of our scheduler is based on the following two key insights: 1) A large percentage of the MapReduce jobs that are run are periodic in nature. That is, these jobs execute at the same time and roughly have the same characteristics with respect to their CPU, network and disk resource requirements. 2) The nodes in a Hadoop cluster over time become heterogeneous in nature as failed and old nodes are replaced by newer ones. Thus, there is need for a 'Context Aware Scheduler for Hadoop (CASH)' which knows the context i. e. the job characteristics (CPU or I/O bound) and the resource characteristics like Computational or I/O strength of the nodes in the cluster.\n We have implemented CASH algorithm in both a simulator and also in a real Hadoop MapReduce cluster. We quantitatively compare CASH with the existing Hadoop FIFO scheduler and our results show significant improvement in the overall execution time of a set of MapReduce jobs. Additionally, we optimized our CASH algorithm for jobs with same working set data and showed the benefits.","PeriodicalId":290400,"journal":{"name":"International Conference on Advances in Computing, Communications and Informatics","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"45","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Advances in Computing, Communications and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2345396.2345406","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 45

Abstract

Hadoop MapReduce infrastructure has been designed to solve problems that can be broken down into tasks that can be solved in parallel. The key reason for MapReduce's popularity is because it can run on commodity hardware and it comes with a job scheduler and task management framework. Thus, the MapReduce framework allows the application programmers to focus on their application program and not on the management infrastructure. Job scheduler is a key component of the MapReduce framework as it controls when and where a job's tasks get executed. However, current MapReduce schedulers assume that the Hadoop cluster is homogeneous in nature. In this paper we show that making the scheduler be aware of and leverage the cluster heterogeneity can improve in the overall throughput of the system. The design of our scheduler is based on the following two key insights: 1) A large percentage of the MapReduce jobs that are run are periodic in nature. That is, these jobs execute at the same time and roughly have the same characteristics with respect to their CPU, network and disk resource requirements. 2) The nodes in a Hadoop cluster over time become heterogeneous in nature as failed and old nodes are replaced by newer ones. Thus, there is need for a 'Context Aware Scheduler for Hadoop (CASH)' which knows the context i. e. the job characteristics (CPU or I/O bound) and the resource characteristics like Computational or I/O strength of the nodes in the cluster. We have implemented CASH algorithm in both a simulator and also in a real Hadoop MapReduce cluster. We quantitatively compare CASH with the existing Hadoop FIFO scheduler and our results show significant improvement in the overall execution time of a set of MapReduce jobs. Additionally, we optimized our CASH algorithm for jobs with same working set data and showed the benefits.

查看原文本刊更多论文

CASH: Hadoop的上下文感知调度器

Hadoop MapReduce基础设施的设计是为了解决可以分解为可以并行解决的任务的问题。MapReduce受欢迎的关键原因是它可以在商用硬件上运行，并附带了作业调度器和任务管理框架。因此，MapReduce框架允许应用程序程序员专注于他们的应用程序，而不是管理基础设施。作业调度器是MapReduce框架的关键组件，因为它控制作业任务的执行时间和地点。然而，当前的MapReduce调度器假设Hadoop集群本质上是同构的。在本文中，我们展示了使调度器意识到并利用集群异构可以提高系统的总体吞吐量。我们的调度器的设计基于以下两个关键见解:1)运行的大部分MapReduce作业本质上是周期性的。也就是说，这些作业同时执行，并且在CPU、网络和磁盘资源需求方面大致具有相同的特征。2) Hadoop集群中的节点随着时间的推移在本质上变得异构，因为失败了，旧节点被新节点取代。因此，需要一个“上下文感知的Hadoop调度器(CASH)”，它知道上下文，即作业特征(CPU或I/O绑定)和资源特征，如集群中节点的计算或I/O强度。我们已经在模拟器和真实的Hadoop MapReduce集群中实现了CASH算法。我们将CASH与现有的Hadoop FIFO调度器进行了定量比较，结果显示，一组MapReduce作业的总体执行时间有了显著改善。此外，我们针对具有相同工作集数据的作业优化了CASH算法，并展示了其好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Advances in Computing, Communications and Informatics

自引率

0.00%

发文量