{"title":"新的多spmd编程/执行环境的容错特性","authors":"Miwako Tsuji, S. Petiton, M. Sato","doi":"10.1145/2832241.2832243","DOIUrl":null,"url":null,"abstract":"Supercomputers in the exascale era would consist of a huge number of nodes arranged in a multi-level hierarchy. There are many important challenges to exploit such systems such as scalability, programmability, reliability and energy efficiency. In the previous work, we had focused on the scalability and programmability. We had proposed FP2C (Framework for Post-Petascale Computing), which is a development and execution environment based on workflow and PGAS (Partitioned Global Address Space) programming models for parallel programming. In this paper, we focus on the reliability. We extend FP2C by adding a fault detection capability to the middleware of FP2C and by incorporating fault resilience scheduling policy into the workflow scheduler. Using the extended FP2C, fault tolerance can be achieved without modifying applications.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Fault tolerance features of a new multi-SPMD programming/execution environment\",\"authors\":\"Miwako Tsuji, S. Petiton, M. Sato\",\"doi\":\"10.1145/2832241.2832243\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Supercomputers in the exascale era would consist of a huge number of nodes arranged in a multi-level hierarchy. There are many important challenges to exploit such systems such as scalability, programmability, reliability and energy efficiency. In the previous work, we had focused on the scalability and programmability. We had proposed FP2C (Framework for Post-Petascale Computing), which is a development and execution environment based on workflow and PGAS (Partitioned Global Address Space) programming models for parallel programming. In this paper, we focus on the reliability. We extend FP2C by adding a fault detection capability to the middleware of FP2C and by incorporating fault resilience scheduling policy into the workflow scheduler. Using the extended FP2C, fault tolerance can be achieved without modifying applications.\",\"PeriodicalId\":347945,\"journal\":{\"name\":\"ESPM '15\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ESPM '15\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2832241.2832243\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESPM '15","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2832241.2832243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
摘要
百亿亿次时代的超级计算机将由大量节点组成,这些节点以多层次的层次结构排列。开发这样的系统有许多重要的挑战,如可扩展性、可编程性、可靠性和能源效率。在之前的工作中,我们关注的是可伸缩性和可编程性。我们提出了FP2C (Post-Petascale Computing Framework),这是一个基于工作流和并行编程的PGAS (Partitioned Global Address Space)编程模型的开发和执行环境。在本文中,我们重点研究了可靠性。我们通过在FP2C中间件中增加故障检测功能和在工作流调度程序中加入故障恢复调度策略来扩展FP2C。使用扩展的FP2C,可以在不修改应用的情况下实现容错。
Fault tolerance features of a new multi-SPMD programming/execution environment
Supercomputers in the exascale era would consist of a huge number of nodes arranged in a multi-level hierarchy. There are many important challenges to exploit such systems such as scalability, programmability, reliability and energy efficiency. In the previous work, we had focused on the scalability and programmability. We had proposed FP2C (Framework for Post-Petascale Computing), which is a development and execution environment based on workflow and PGAS (Partitioned Global Address Space) programming models for parallel programming. In this paper, we focus on the reliability. We extend FP2C by adding a fault detection capability to the middleware of FP2C and by incorporating fault resilience scheduling policy into the workflow scheduler. Using the extended FP2C, fault tolerance can be achieved without modifying applications.