{"title":"Automating Platform Selection for MapReduce Processing in the Cloud","authors":"Zhuoyao Zhang, L. Cherkasova, B. T. Loo","doi":"10.1109/ICCAC.2015.15","DOIUrl":null,"url":null,"abstract":"Cloud computing enables a user to quickly provision any desirable size Hadoop cluster and then pay for the time these resources were used. With the same budget, a user can rent a larger amount of resources and process its scale-out application in a shorter time, or rent a smaller size cluster but pay a for longer processing time. Moreover, there is a variety of different types of VM instances in the Cloud (e.g., small, medium, or large EC2 instances). The capacity differences of the offered VMs are reflected in VM's pricing. Therefore, again for the same price a user can get a variety of \"similar capacity\" Hadoop clusters based on different VM instance types. We observe that performance of MapReduce applications may vary significantly on different platforms. This makes a selection of the best cost/performance platform for a given workload a non-trivial problem, especially when it contains multiple jobs with different platform preferences. In this work1, we design a framework for solving the following problem: given a completion time target for a set of MapReduce jobs, determine a homogeneous or heterogeneous Hadoop cluster configuration (i.e., the number, types of VMs, and the job schedule) for processing these jobs within a given deadline while minimizing the rented infrastructure cost. We generalize the proposed framework to take into account possible node failures and degraded performance goals. Our evaluation study with Amazon EC2 platform reveals that for different workload mixes, an optimized platform choice may result in 45-68% cost savings for achieving the same performance objectives when using different (but seemingly equivalent) choices. Moreover, depending on a workload the heterogeneous solution may outperform the homogeneous cluster solution by 26-42%. We analyze and discuss possible causes for observed performance differences of MapReduce processing on the Amazon EC2 platforms.","PeriodicalId":133491,"journal":{"name":"2015 International Conference on Cloud and Autonomic Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Cloud and Autonomic Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCAC.2015.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
Cloud computing enables a user to quickly provision any desirable size Hadoop cluster and then pay for the time these resources were used. With the same budget, a user can rent a larger amount of resources and process its scale-out application in a shorter time, or rent a smaller size cluster but pay a for longer processing time. Moreover, there is a variety of different types of VM instances in the Cloud (e.g., small, medium, or large EC2 instances). The capacity differences of the offered VMs are reflected in VM's pricing. Therefore, again for the same price a user can get a variety of "similar capacity" Hadoop clusters based on different VM instance types. We observe that performance of MapReduce applications may vary significantly on different platforms. This makes a selection of the best cost/performance platform for a given workload a non-trivial problem, especially when it contains multiple jobs with different platform preferences. In this work1, we design a framework for solving the following problem: given a completion time target for a set of MapReduce jobs, determine a homogeneous or heterogeneous Hadoop cluster configuration (i.e., the number, types of VMs, and the job schedule) for processing these jobs within a given deadline while minimizing the rented infrastructure cost. We generalize the proposed framework to take into account possible node failures and degraded performance goals. Our evaluation study with Amazon EC2 platform reveals that for different workload mixes, an optimized platform choice may result in 45-68% cost savings for achieving the same performance objectives when using different (but seemingly equivalent) choices. Moreover, depending on a workload the heterogeneous solution may outperform the homogeneous cluster solution by 26-42%. We analyze and discuss possible causes for observed performance differences of MapReduce processing on the Amazon EC2 platforms.