rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks

2019 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2019-07-01 DOI:10.1109/HPCS48598.2019.9188153

Ali Mohammed, Aurélien Cavelan, F. Ciorba

{"title":"rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks","authors":"Ali Mohammed, Aurélien Cavelan, F. Ciorba","doi":"10.1109/HPCS48598.2019.9188153","DOIUrl":null,"url":null,"abstract":"Parallel scientific applications that execute on high performance computing (HPC) systems often contain large and computationally-intensive parallel loops. The independent loop iterations of such applications represent independent tasks. Dynamic toad balancing (DLB) is used to achieve a balanced execution of such applications. However, most of the self-scheduling-based techniques that are typically used to achieve DLB are not robust against component (e.g., processors, network) failures or perturbations that arise on large HPC systems. The self-scheduling-based techniques that tolerate failures and/or perturbations rely on the existence of fault-and/or perturbation-detection mechanisms to trigger the rescheduling of tasks scheduled onto failed and/or perturbed components. This work proposes a novel robust dynamic load balancing (rDLB) approach for the robust self-scheduling of scientific applications with independent tasks on HPC systems under failures and/or perturbations. rDLB proactively reschedules already allocated tasks and requires no detection of failures or perturbations. Moreover, rDLB is integrated into an MPI-based DLB library. An analytical modeling of rDLB shows that for a fixed problem size, the fault-tolerance overhead linearly decreases with the number of processors. The experimental evaluation shows that applications using rDLB tolerate up to P-l worker processor failures (P-is the number of processors allocated to the application) and that their performance in the presence of perturbations improved by a factor of 7 compared to the case without rDLB. Moreover, the robustness of applications against perturbations (i.e., flexibility) is boosted by a factor of 30 using rDLB compared to the case without rDLB.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS48598.2019.9188153","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Parallel scientific applications that execute on high performance computing (HPC) systems often contain large and computationally-intensive parallel loops. The independent loop iterations of such applications represent independent tasks. Dynamic toad balancing (DLB) is used to achieve a balanced execution of such applications. However, most of the self-scheduling-based techniques that are typically used to achieve DLB are not robust against component (e.g., processors, network) failures or perturbations that arise on large HPC systems. The self-scheduling-based techniques that tolerate failures and/or perturbations rely on the existence of fault-and/or perturbation-detection mechanisms to trigger the rescheduling of tasks scheduled onto failed and/or perturbed components. This work proposes a novel robust dynamic load balancing (rDLB) approach for the robust self-scheduling of scientific applications with independent tasks on HPC systems under failures and/or perturbations. rDLB proactively reschedules already allocated tasks and requires no detection of failures or perturbations. Moreover, rDLB is integrated into an MPI-based DLB library. An analytical modeling of rDLB shows that for a fixed problem size, the fault-tolerance overhead linearly decreases with the number of processors. The experimental evaluation shows that applications using rDLB tolerate up to P-l worker processor failures (P-is the number of processors allocated to the application) and that their performance in the presence of perturbations improved by a factor of 7 compared to the case without rDLB. Moreover, the robustness of applications against perturbations (i.e., flexibility) is boosted by a factor of 30 using rDLB compared to the case without rDLB.

查看原文本刊更多论文

rDLB:具有独立任务的科学应用鲁棒动态负载平衡的新方法

在高性能计算(HPC)系统上执行的并行科学应用程序通常包含大型且计算密集型的并行循环。这些应用程序的独立循环迭代表示独立的任务。动态负载平衡(DLB)用于实现这类应用程序的平衡执行。然而，大多数通常用于实现DLB的基于自调度的技术对于大型HPC系统上出现的组件(例如处理器、网络)故障或扰动并不健壮。容忍故障和/或扰动的基于自调度的技术依赖于故障和/或扰动检测机制的存在，以触发在故障和/或扰动组件上调度的任务的重新调度。这项工作提出了一种新的鲁棒动态负载平衡(rDLB)方法，用于在故障和/或扰动下高性能计算系统上具有独立任务的科学应用程序的鲁棒自调度。rDLB主动重新调度已经分配的任务，不需要检测故障或扰动。将rDLB集成到基于mpi的DLB库中。rDLB的分析建模表明，对于固定的问题大小，容错开销随着处理器数量的增加而线性降低。实验评估表明，使用rDLB的应用程序最多可以容忍p - 1个工作处理器故障(p是分配给应用程序的处理器数量)，并且与不使用rDLB的情况相比，它们在存在扰动的情况下的性能提高了7倍。此外，与不使用rDLB的情况相比，使用rDLB的应用程序对扰动的鲁棒性(即灵活性)提高了30倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量