{"title":"基于成本意识和延迟效益评估的Apache Spark任务调度优化策略","authors":"Qingsong Xu, Congyang Wang, Junyang Yu, Haifeng Fei, Xiaojin Ren","doi":"10.1002/cpe.70244","DOIUrl":null,"url":null,"abstract":"<div>\n \n <p>In the Spark distributed framework, data communication problems (network transfer overhead, network IO bottlenecks) caused by data transfer across nodes/racks are a common cause of performance degradation due to the inconsistency between the task execution location and the data location. Additionally, in heterogeneous environments, Spark's task scheduling strategy cannot fully utilize the advantages of high-performance nodes. To address the above issues, firstly, this paper proposes a cost-aware task selection strategy. The strategy models the cost of tasks by considering the impact of data locality and heterogeneous factors on the efficiency of job execution. For scenarios where data locality needs to be reduced for scheduling tasks, the task scheduling problem is transformed into a minimum weighted bipartite graph matching problem, and a greedy matching algorithm is used to solve for the minimum processing cost option. For scenarios that maintain the current data localization level for scheduling tasks, select the task execution with the largest change in task processing cost due to data localization changes. Secondly, the problem is that Spark's delay scheduling algorithm causes resources in the cluster to be in an unnecessary waiting state and reduces cluster resource utilization. In this paper, we propose an adaptive adjustment strategy for delay waiting time based on benefit assessment. This policy improves the resource utilization of the cluster by evaluating the benefit of delay waiting of the scheduler and dynamically adjusting the delay time based on the result of the evaluation. Finally, we implement the proposed strategy in Spark 3.0.0 and evaluate its performance using some representative benchmarks. The experimental results show that, compared with other task scheduling algorithms, the strategy proposed in this paper can effectively improve the execution efficiency of jobs, reduce the execution time of jobs by 15.8%–31.9%, and at the same time reduce the network traffic and improve the CPU utilization.</p>\n </div>","PeriodicalId":55214,"journal":{"name":"Concurrency and Computation-Practice & Experience","volume":"37 23-24","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Cost-Aware and Latency-Benefit Evaluation-Based Task Scheduling Optimization Strategy in Apache Spark\",\"authors\":\"Qingsong Xu, Congyang Wang, Junyang Yu, Haifeng Fei, Xiaojin Ren\",\"doi\":\"10.1002/cpe.70244\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n <p>In the Spark distributed framework, data communication problems (network transfer overhead, network IO bottlenecks) caused by data transfer across nodes/racks are a common cause of performance degradation due to the inconsistency between the task execution location and the data location. Additionally, in heterogeneous environments, Spark's task scheduling strategy cannot fully utilize the advantages of high-performance nodes. To address the above issues, firstly, this paper proposes a cost-aware task selection strategy. The strategy models the cost of tasks by considering the impact of data locality and heterogeneous factors on the efficiency of job execution. For scenarios where data locality needs to be reduced for scheduling tasks, the task scheduling problem is transformed into a minimum weighted bipartite graph matching problem, and a greedy matching algorithm is used to solve for the minimum processing cost option. For scenarios that maintain the current data localization level for scheduling tasks, select the task execution with the largest change in task processing cost due to data localization changes. Secondly, the problem is that Spark's delay scheduling algorithm causes resources in the cluster to be in an unnecessary waiting state and reduces cluster resource utilization. In this paper, we propose an adaptive adjustment strategy for delay waiting time based on benefit assessment. This policy improves the resource utilization of the cluster by evaluating the benefit of delay waiting of the scheduler and dynamically adjusting the delay time based on the result of the evaluation. Finally, we implement the proposed strategy in Spark 3.0.0 and evaluate its performance using some representative benchmarks. The experimental results show that, compared with other task scheduling algorithms, the strategy proposed in this paper can effectively improve the execution efficiency of jobs, reduce the execution time of jobs by 15.8%–31.9%, and at the same time reduce the network traffic and improve the CPU utilization.</p>\\n </div>\",\"PeriodicalId\":55214,\"journal\":{\"name\":\"Concurrency and Computation-Practice & Experience\",\"volume\":\"37 23-24\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Concurrency and Computation-Practice & Experience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70244\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Concurrency and Computation-Practice & Experience","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpe.70244","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
A Cost-Aware and Latency-Benefit Evaluation-Based Task Scheduling Optimization Strategy in Apache Spark
In the Spark distributed framework, data communication problems (network transfer overhead, network IO bottlenecks) caused by data transfer across nodes/racks are a common cause of performance degradation due to the inconsistency between the task execution location and the data location. Additionally, in heterogeneous environments, Spark's task scheduling strategy cannot fully utilize the advantages of high-performance nodes. To address the above issues, firstly, this paper proposes a cost-aware task selection strategy. The strategy models the cost of tasks by considering the impact of data locality and heterogeneous factors on the efficiency of job execution. For scenarios where data locality needs to be reduced for scheduling tasks, the task scheduling problem is transformed into a minimum weighted bipartite graph matching problem, and a greedy matching algorithm is used to solve for the minimum processing cost option. For scenarios that maintain the current data localization level for scheduling tasks, select the task execution with the largest change in task processing cost due to data localization changes. Secondly, the problem is that Spark's delay scheduling algorithm causes resources in the cluster to be in an unnecessary waiting state and reduces cluster resource utilization. In this paper, we propose an adaptive adjustment strategy for delay waiting time based on benefit assessment. This policy improves the resource utilization of the cluster by evaluating the benefit of delay waiting of the scheduler and dynamically adjusting the delay time based on the result of the evaluation. Finally, we implement the proposed strategy in Spark 3.0.0 and evaluate its performance using some representative benchmarks. The experimental results show that, compared with other task scheduling algorithms, the strategy proposed in this paper can effectively improve the execution efficiency of jobs, reduce the execution time of jobs by 15.8%–31.9%, and at the same time reduce the network traffic and improve the CPU utilization.
期刊介绍:
Concurrency and Computation: Practice and Experience (CCPE) publishes high-quality, original research papers, and authoritative research review papers, in the overlapping fields of:
Parallel and distributed computing;
High-performance computing;
Computational and data science;
Artificial intelligence and machine learning;
Big data applications, algorithms, and systems;
Network science;
Ontologies and semantics;
Security and privacy;
Cloud/edge/fog computing;
Green computing; and
Quantum computing.