Overcoming Hadoop Scaling Limitations through Distributed Task Execution

2015 IEEE International Conference on Cluster Computing Pub Date : 2015-09-08 DOI:10.1109/CLUSTER.2015.42

Ke Wang, Ning Liu, Iman Sadooghi, Xi Yang, Xiaobing Zhou, Tonglin Li, M. Lang, Xian-He Sun, I. Raicu

{"title":"Overcoming Hadoop Scaling Limitations through Distributed Task Execution","authors":"Ke Wang, Ning Liu, Iman Sadooghi, Xi Yang, Xiaobing Zhou, Tonglin Li, M. Lang, Xian-He Sun, I. Raicu","doi":"10.1109/CLUSTER.2015.42","DOIUrl":null,"url":null,"abstract":"Data driven programming models like MapReduce have gained the popularity in large-scale data processing. Although great efforts through the Hadoop implementation and framework decoupling (e.g. YARN, Mesos) have allowed Hadoop to scale to tens of thousands of commodity cluster processors, the centralized designs of the resource manager, task scheduler and metadata management of HDFS file system adversely affect Hadoop's scalability to tomorrow's extreme-scale data centers. This paper aims to address the YARN scaling issues through a distributed task execution framework, MATRIX, which was originally designed to schedule the executions of data-intensive scientific applications of many-task computing on supercomputers. We propose to leverage the distributed design wisdoms of MATRIX to schedule arbitrary data processing applications in cloud. We compare MATRIX with YARN in processing typical Hadoop workloads, such as WordCount, TeraSort, Grep and RandomWriter, and the Ligand application in Bioinformatics on the Amazon Cloud. Experimental results show that MATRIX outperforms YARN by 1.27X for the typical workloads, and by 2.04X for the real application. We also run and simulate MATRIX with fine-grained sub-second workloads. With the simulation results giving the efficiency of 86.8% at 64K cores for the 150ms workload, we show that MATRIX has the potential to enable Hadoop to scale to extreme-scale data centers for fine-grained workloads.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"88 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"68","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.42","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 68

Abstract

Data driven programming models like MapReduce have gained the popularity in large-scale data processing. Although great efforts through the Hadoop implementation and framework decoupling (e.g. YARN, Mesos) have allowed Hadoop to scale to tens of thousands of commodity cluster processors, the centralized designs of the resource manager, task scheduler and metadata management of HDFS file system adversely affect Hadoop's scalability to tomorrow's extreme-scale data centers. This paper aims to address the YARN scaling issues through a distributed task execution framework, MATRIX, which was originally designed to schedule the executions of data-intensive scientific applications of many-task computing on supercomputers. We propose to leverage the distributed design wisdoms of MATRIX to schedule arbitrary data processing applications in cloud. We compare MATRIX with YARN in processing typical Hadoop workloads, such as WordCount, TeraSort, Grep and RandomWriter, and the Ligand application in Bioinformatics on the Amazon Cloud. Experimental results show that MATRIX outperforms YARN by 1.27X for the typical workloads, and by 2.04X for the real application. We also run and simulate MATRIX with fine-grained sub-second workloads. With the simulation results giving the efficiency of 86.8% at 64K cores for the 150ms workload, we show that MATRIX has the potential to enable Hadoop to scale to extreme-scale data centers for fine-grained workloads.

查看原文本刊更多论文

通过分布式任务执行克服Hadoop的扩展限制

像MapReduce这样的数据驱动编程模型在大规模数据处理中得到了广泛的应用。尽管通过Hadoop实现和框架解耦(例如YARN, Mesos)的巨大努力使Hadoop能够扩展到成千上万的商品集群处理器，但HDFS文件系统的资源管理器，任务调度程序和元数据管理的集中式设计对Hadoop的可扩展性产生了不利影响。本文旨在通过分布式任务执行框架MATRIX来解决YARN的扩展问题，该框架最初设计用于在超级计算机上调度多任务计算的数据密集型科学应用程序的执行。我们建议利用MATRIX的分布式设计智慧来调度云中的任意数据处理应用程序。我们比较了MATRIX和YARN在处理典型的Hadoop工作负载，如WordCount, TeraSort, Grep和RandomWriter，以及Amazon Cloud上生物信息学中的配体应用。实验结果表明，在典型工作负载下，MATRIX的性能比YARN高1.27倍，在实际应用中，MATRIX的性能比YARN高2.04倍。我们还使用细粒度的亚秒级工作负载运行和模拟MATRIX。模拟结果显示，对于150ms工作负载，64K核的效率为86.8%，我们表明MATRIX有潜力使Hadoop能够扩展到细粒度工作负载的极端规模数据中心。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量