{"title":"rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Independent Tasks","authors":"Ali Mohammed, Aurélien Cavelan, F. Ciorba","doi":"10.1109/HPCS48598.2019.9188153","DOIUrl":null,"url":null,"abstract":"Parallel scientific applications that execute on high performance computing (HPC) systems often contain large and computationally-intensive parallel loops. The independent loop iterations of such applications represent independent tasks. Dynamic toad balancing (DLB) is used to achieve a balanced execution of such applications. However, most of the self-scheduling-based techniques that are typically used to achieve DLB are not robust against component (e.g., processors, network) failures or perturbations that arise on large HPC systems. The self-scheduling-based techniques that tolerate failures and/or perturbations rely on the existence of fault-and/or perturbation-detection mechanisms to trigger the rescheduling of tasks scheduled onto failed and/or perturbed components. This work proposes a novel robust dynamic load balancing (rDLB) approach for the robust self-scheduling of scientific applications with independent tasks on HPC systems under failures and/or perturbations. rDLB proactively reschedules already allocated tasks and requires no detection of failures or perturbations. Moreover, rDLB is integrated into an MPI-based DLB library. An analytical modeling of rDLB shows that for a fixed problem size, the fault-tolerance overhead linearly decreases with the number of processors. The experimental evaluation shows that applications using rDLB tolerate up to P-l worker processor failures (P-is the number of processors allocated to the application) and that their performance in the presence of perturbations improved by a factor of 7 compared to the case without rDLB. Moreover, the robustness of applications against perturbations (i.e., flexibility) is boosted by a factor of 30 using rDLB compared to the case without rDLB.","PeriodicalId":371856,"journal":{"name":"2019 International Conference on High Performance Computing & Simulation (HPCS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS48598.2019.9188153","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Parallel scientific applications that execute on high performance computing (HPC) systems often contain large and computationally-intensive parallel loops. The independent loop iterations of such applications represent independent tasks. Dynamic toad balancing (DLB) is used to achieve a balanced execution of such applications. However, most of the self-scheduling-based techniques that are typically used to achieve DLB are not robust against component (e.g., processors, network) failures or perturbations that arise on large HPC systems. The self-scheduling-based techniques that tolerate failures and/or perturbations rely on the existence of fault-and/or perturbation-detection mechanisms to trigger the rescheduling of tasks scheduled onto failed and/or perturbed components. This work proposes a novel robust dynamic load balancing (rDLB) approach for the robust self-scheduling of scientific applications with independent tasks on HPC systems under failures and/or perturbations. rDLB proactively reschedules already allocated tasks and requires no detection of failures or perturbations. Moreover, rDLB is integrated into an MPI-based DLB library. An analytical modeling of rDLB shows that for a fixed problem size, the fault-tolerance overhead linearly decreases with the number of processors. The experimental evaluation shows that applications using rDLB tolerate up to P-l worker processor failures (P-is the number of processors allocated to the application) and that their performance in the presence of perturbations improved by a factor of 7 compared to the case without rDLB. Moreover, the robustness of applications against perturbations (i.e., flexibility) is boosted by a factor of 30 using rDLB compared to the case without rDLB.