{"title":"Improving Data Locality of Tasks by Executor Allocation in Spark Computing Environment","authors":"Zhongming Fu;Mengsi He;Yang Yi;Zhuo Tang","doi":"10.1109/TCC.2024.3406041","DOIUrl":null,"url":null,"abstract":"The concept of data locality is crucial for distributed systems (e.g., Spark and Hadoop) to process Big Data. Most of the existing research optimized the data locality from the aspect of task scheduling. However, as the execution container of Spark's tasks, the executor launched on different nodes can directly affect the data locality achieved by the tasks. This article tries to improve the data locality of tasks by executor allocation in Spark framework. First, because of different communication modes at stages, we separately model the communication cost of tasks for transferring input data to the executors. Then formalize an optimal executor allocation problem to minimize the total communication cost of transferring all input data. This problem is proven to be NP-hard. Finally, we present a greed dropping heuristic algorithm to provide solution to the executor allocation problem. Our proposals are implemented in Spark-3.4.0 and its performance is evaluated through representative micro-benchmarks (i.e., \n<italic>WordCount</i>\n, \n<italic>Join</i>\n, \n<italic>Sort</i>\n) and macro-benchmarks (i.e., \n<italic>PageRank</i>\n and \n<italic>LDA</i>\n). Extensive experiments show that the proposed executor allocation strategy can decrease the network traffic and data access time by improving the data locality during the task scheduling. Its performance benefits are particularly significant for iterative applications.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 3","pages":"876-888"},"PeriodicalIF":5.3000,"publicationDate":"2024-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cloud Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10540265/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The concept of data locality is crucial for distributed systems (e.g., Spark and Hadoop) to process Big Data. Most of the existing research optimized the data locality from the aspect of task scheduling. However, as the execution container of Spark's tasks, the executor launched on different nodes can directly affect the data locality achieved by the tasks. This article tries to improve the data locality of tasks by executor allocation in Spark framework. First, because of different communication modes at stages, we separately model the communication cost of tasks for transferring input data to the executors. Then formalize an optimal executor allocation problem to minimize the total communication cost of transferring all input data. This problem is proven to be NP-hard. Finally, we present a greed dropping heuristic algorithm to provide solution to the executor allocation problem. Our proposals are implemented in Spark-3.4.0 and its performance is evaluated through representative micro-benchmarks (i.e.,
WordCount
,
Join
,
Sort
) and macro-benchmarks (i.e.,
PageRank
and
LDA
). Extensive experiments show that the proposed executor allocation strategy can decrease the network traffic and data access time by improving the data locality during the task scheduling. Its performance benefits are particularly significant for iterative applications.
期刊介绍:
The IEEE Transactions on Cloud Computing (TCC) is dedicated to the multidisciplinary field of cloud computing. It is committed to the publication of articles that present innovative research ideas, application results, and case studies in cloud computing, focusing on key technical issues related to theory, algorithms, systems, applications, and performance.