{"title":"Sparker: Optimizing Spark for Heterogeneous Clusters","authors":"Nishank Garg, D. Janakiram","doi":"10.1109/CloudCom2018.2018.00017","DOIUrl":null,"url":null,"abstract":"Spark is an in-memory big data analytics framework which has replaced Hadoop as the de facto standard for processing big data in cloud platforms. These frameworks run on cloud platforms where heterogeneity is a common scenario. Heterogeneity gets introduced due to the failure, addition or upgradation of nodes in the cloud platforms. It can arise from various factors such as variation in the number of CPU cores, amount of memory, disk read/write latencies across the nodes, etc. These factors have a significant impact on the performance of Spark jobs. Spark supports execution of a job on equal-sized executors which can result in under allocation of resources in a heterogeneous cluster. Insufficient resources can severely degrade the performance of CPU and memory intensive applications like machine learning, graph processing, etc. Existing techniques use equal-sized executors which can degrade the performance of jobs in heterogeneous environments. In this paper, we propose Sparker, an efficient resource-aware optimization strategy for Spark in heterogeneous clusters. It overcomes the limitation of heterogeneity in terms of CPU and memory resources by modifying the size of the executor. The executors are re-sized based on the available resources of the node. We have modified Spark source code to incorporate executor re-sizing strategy. Experimental evaluation on SparkBench benchmark shows that our approach achieves a reduction of up to 46% in execution time.","PeriodicalId":365939,"journal":{"name":"2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudCom2018.2018.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Spark is an in-memory big data analytics framework which has replaced Hadoop as the de facto standard for processing big data in cloud platforms. These frameworks run on cloud platforms where heterogeneity is a common scenario. Heterogeneity gets introduced due to the failure, addition or upgradation of nodes in the cloud platforms. It can arise from various factors such as variation in the number of CPU cores, amount of memory, disk read/write latencies across the nodes, etc. These factors have a significant impact on the performance of Spark jobs. Spark supports execution of a job on equal-sized executors which can result in under allocation of resources in a heterogeneous cluster. Insufficient resources can severely degrade the performance of CPU and memory intensive applications like machine learning, graph processing, etc. Existing techniques use equal-sized executors which can degrade the performance of jobs in heterogeneous environments. In this paper, we propose Sparker, an efficient resource-aware optimization strategy for Spark in heterogeneous clusters. It overcomes the limitation of heterogeneity in terms of CPU and memory resources by modifying the size of the executor. The executors are re-sized based on the available resources of the node. We have modified Spark source code to incorporate executor re-sizing strategy. Experimental evaluation on SparkBench benchmark shows that our approach achieves a reduction of up to 46% in execution time.