{"title":"Fault tolerance features of a new multi-SPMD programming/execution environment","authors":"Miwako Tsuji, S. Petiton, M. Sato","doi":"10.1145/2832241.2832243","DOIUrl":null,"url":null,"abstract":"Supercomputers in the exascale era would consist of a huge number of nodes arranged in a multi-level hierarchy. There are many important challenges to exploit such systems such as scalability, programmability, reliability and energy efficiency. In the previous work, we had focused on the scalability and programmability. We had proposed FP2C (Framework for Post-Petascale Computing), which is a development and execution environment based on workflow and PGAS (Partitioned Global Address Space) programming models for parallel programming. In this paper, we focus on the reliability. We extend FP2C by adding a fault detection capability to the middleware of FP2C and by incorporating fault resilience scheduling policy into the workflow scheduler. Using the extended FP2C, fault tolerance can be achieved without modifying applications.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESPM '15","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2832241.2832243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Supercomputers in the exascale era would consist of a huge number of nodes arranged in a multi-level hierarchy. There are many important challenges to exploit such systems such as scalability, programmability, reliability and energy efficiency. In the previous work, we had focused on the scalability and programmability. We had proposed FP2C (Framework for Post-Petascale Computing), which is a development and execution environment based on workflow and PGAS (Partitioned Global Address Space) programming models for parallel programming. In this paper, we focus on the reliability. We extend FP2C by adding a fault detection capability to the middleware of FP2C and by incorporating fault resilience scheduling policy into the workflow scheduler. Using the extended FP2C, fault tolerance can be achieved without modifying applications.