Automating Platform Selection for MapReduce Processing in the Cloud

2015 International Conference on Cloud and Autonomic Computing Pub Date : 2015-09-01 DOI:10.1109/ICCAC.2015.15

Zhuoyao Zhang, L. Cherkasova, B. T. Loo

{"title":"Automating Platform Selection for MapReduce Processing in the Cloud","authors":"Zhuoyao Zhang, L. Cherkasova, B. T. Loo","doi":"10.1109/ICCAC.2015.15","DOIUrl":null,"url":null,"abstract":"Cloud computing enables a user to quickly provision any desirable size Hadoop cluster and then pay for the time these resources were used. With the same budget, a user can rent a larger amount of resources and process its scale-out application in a shorter time, or rent a smaller size cluster but pay a for longer processing time. Moreover, there is a variety of different types of VM instances in the Cloud (e.g., small, medium, or large EC2 instances). The capacity differences of the offered VMs are reflected in VM's pricing. Therefore, again for the same price a user can get a variety of \"similar capacity\" Hadoop clusters based on different VM instance types. We observe that performance of MapReduce applications may vary significantly on different platforms. This makes a selection of the best cost/performance platform for a given workload a non-trivial problem, especially when it contains multiple jobs with different platform preferences. In this work1, we design a framework for solving the following problem: given a completion time target for a set of MapReduce jobs, determine a homogeneous or heterogeneous Hadoop cluster configuration (i.e., the number, types of VMs, and the job schedule) for processing these jobs within a given deadline while minimizing the rented infrastructure cost. We generalize the proposed framework to take into account possible node failures and degraded performance goals. Our evaluation study with Amazon EC2 platform reveals that for different workload mixes, an optimized platform choice may result in 45-68% cost savings for achieving the same performance objectives when using different (but seemingly equivalent) choices. Moreover, depending on a workload the heterogeneous solution may outperform the homogeneous cluster solution by 26-42%. We analyze and discuss possible causes for observed performance differences of MapReduce processing on the Amazon EC2 platforms.","PeriodicalId":133491,"journal":{"name":"2015 International Conference on Cloud and Autonomic Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Cloud and Autonomic Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCAC.2015.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Cloud computing enables a user to quickly provision any desirable size Hadoop cluster and then pay for the time these resources were used. With the same budget, a user can rent a larger amount of resources and process its scale-out application in a shorter time, or rent a smaller size cluster but pay a for longer processing time. Moreover, there is a variety of different types of VM instances in the Cloud (e.g., small, medium, or large EC2 instances). The capacity differences of the offered VMs are reflected in VM's pricing. Therefore, again for the same price a user can get a variety of "similar capacity" Hadoop clusters based on different VM instance types. We observe that performance of MapReduce applications may vary significantly on different platforms. This makes a selection of the best cost/performance platform for a given workload a non-trivial problem, especially when it contains multiple jobs with different platform preferences. In this work1, we design a framework for solving the following problem: given a completion time target for a set of MapReduce jobs, determine a homogeneous or heterogeneous Hadoop cluster configuration (i.e., the number, types of VMs, and the job schedule) for processing these jobs within a given deadline while minimizing the rented infrastructure cost. We generalize the proposed framework to take into account possible node failures and degraded performance goals. Our evaluation study with Amazon EC2 platform reveals that for different workload mixes, an optimized platform choice may result in 45-68% cost savings for achieving the same performance objectives when using different (but seemingly equivalent) choices. Moreover, depending on a workload the heterogeneous solution may outperform the homogeneous cluster solution by 26-42%. We analyze and discuss possible causes for observed performance differences of MapReduce processing on the Amazon EC2 platforms.

查看原文本刊更多论文

云中MapReduce处理的自动化平台选择

云计算使用户能够快速配置任何所需大小的Hadoop集群，然后为这些资源的使用时间付费。在相同的预算下，用户可以租用大量的资源，并在较短的时间内处理其横向扩展应用程序，或者租用较小规模的集群，但支付较长的处理时间。此外，云中有各种不同类型的VM实例(例如，小型、中型或大型EC2实例)。所提供的虚拟机的容量差异反映在虚拟机的定价中。因此，同样的价格，用户可以根据不同的虚拟机实例类型获得各种“类似容量”的Hadoop集群。我们观察到MapReduce应用程序在不同平台上的性能可能会有很大差异。这使得为给定的工作负载选择最佳的成本/性能平台成为一个非常重要的问题，特别是当它包含多个具有不同平台首选项的作业时。在这项工作中，我们设计了一个框架来解决以下问题:给定一组MapReduce作业的完成时间目标，确定在给定期限内处理这些作业的同构或异构Hadoop集群配置(即虚拟机的数量、类型和作业计划)，同时最小化租用基础设施成本。我们对提出的框架进行了推广，以考虑可能的节点故障和性能目标的降低。我们对Amazon EC2平台的评估研究表明，对于不同的工作负载组合，当使用不同(但看似相当)的选择时，优化的平台选择可能会在实现相同性能目标时节省45-68%的成本。此外，根据工作负载的不同，异构解决方案的性能可能比同构集群解决方案高出26-42%。我们分析和讨论了在Amazon EC2平台上观察到的MapReduce处理性能差异的可能原因。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 International Conference on Cloud and Autonomic Computing

自引率

0.00%

发文量