Scale-up vs scale-out for Hadoop: time to rethink?

Proceedings of the 4th annual Symposium on Cloud Computing Pub Date : 2013-10-01 DOI:10.1145/2523616.2523629

Raja Appuswamy, C. Gkantsidis, D. Narayanan, O. Hodson, A. Rowstron

{"title":"Scale-up vs scale-out for Hadoop: time to rethink?","authors":"Raja Appuswamy, C. Gkantsidis, D. Narayanan, O. Hodson, A. Rowstron","doi":"10.1145/2523616.2523629","DOIUrl":null,"url":null,"abstract":"In the last decade we have seen a huge deployment of cheap clusters to run data analytics workloads. The conventional wisdom in industry and academia is that scaling out using a cluster of commodity machines is better for these workloads than scaling up by adding more resources to a single server. Popular analytics infrastructures such as Hadoop are aimed at such a cluster scale-out environment. Is this the right approach? Our measurements as well as other recent work shows that the majority of real-world analytic jobs process less than 100 GB of input, but popular infrastructures such as Hadoop/MapReduce were originally designed for petascale processing. We claim that a single \"scale-up\" server can process each of these jobs and do as well or better than a cluster in terms of performance, cost, power, and server density. We present an evaluation across 11 representative Hadoop jobs that shows scale-up to be competitive in all cases and significantly better in some cases, than scale-out. To achieve that performance, we describe several modifications to the Hadoop runtime that target scale-up configuration. These changes are transparent, do not require any changes to application code, and do not compromise scale-out performance; at the same time our evaluation shows that they do significantly improve Hadoop's scale-up performance.","PeriodicalId":298547,"journal":{"name":"Proceedings of the 4th annual Symposium on Cloud Computing","volume":"108 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"178","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th annual Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2523616.2523629","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 178

Abstract

In the last decade we have seen a huge deployment of cheap clusters to run data analytics workloads. The conventional wisdom in industry and academia is that scaling out using a cluster of commodity machines is better for these workloads than scaling up by adding more resources to a single server. Popular analytics infrastructures such as Hadoop are aimed at such a cluster scale-out environment. Is this the right approach? Our measurements as well as other recent work shows that the majority of real-world analytic jobs process less than 100 GB of input, but popular infrastructures such as Hadoop/MapReduce were originally designed for petascale processing. We claim that a single "scale-up" server can process each of these jobs and do as well or better than a cluster in terms of performance, cost, power, and server density. We present an evaluation across 11 representative Hadoop jobs that shows scale-up to be competitive in all cases and significantly better in some cases, than scale-out. To achieve that performance, we describe several modifications to the Hadoop runtime that target scale-up configuration. These changes are transparent, do not require any changes to application code, and do not compromise scale-out performance; at the same time our evaluation shows that they do significantly improve Hadoop's scale-up performance.

查看原文本刊更多论文

Hadoop的Scale-up vs . scale-out:是时候重新思考了?

在过去的十年中，我们看到了大量的廉价集群来运行数据分析工作负载。工业界和学术界的传统观点是，对于这些工作负载，使用商品机器集群向外扩展比通过向单个服务器添加更多资源进行扩展更好。流行的分析基础设施，如Hadoop，就是针对这种集群向外扩展的环境。这是正确的方法吗?我们的测量和其他最近的工作表明，大多数现实世界的分析工作处理的输入少于100gb，但流行的基础设施，如Hadoop/MapReduce，最初是为千兆级处理而设计的。我们声称单个“扩展”服务器可以处理这些任务，并且在性能、成本、功耗和服务器密度方面与集群一样好，甚至更好。我们对11个具有代表性的Hadoop工作进行了评估，结果显示，在所有情况下，按比例扩展都具有竞争力，在某些情况下，比按比例扩展要好得多。为了实现这种性能，我们描述了针对扩展配置的Hadoop运行时的几个修改。这些更改是透明的，不需要对应用程序代码进行任何更改，也不会影响横向扩展性能;同时，我们的评估表明，它们确实显著提高了Hadoop的扩展性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 4th annual Symposium on Cloud Computing

自引率

0.00%

发文量