新的多spmd编程/执行环境的容错特性

ESPM '15 Pub Date : 2015-11-15 DOI:10.1145/2832241.2832243
Miwako Tsuji, S. Petiton, M. Sato
{"title":"新的多spmd编程/执行环境的容错特性","authors":"Miwako Tsuji, S. Petiton, M. Sato","doi":"10.1145/2832241.2832243","DOIUrl":null,"url":null,"abstract":"Supercomputers in the exascale era would consist of a huge number of nodes arranged in a multi-level hierarchy. There are many important challenges to exploit such systems such as scalability, programmability, reliability and energy efficiency. In the previous work, we had focused on the scalability and programmability. We had proposed FP2C (Framework for Post-Petascale Computing), which is a development and execution environment based on workflow and PGAS (Partitioned Global Address Space) programming models for parallel programming. In this paper, we focus on the reliability. We extend FP2C by adding a fault detection capability to the middleware of FP2C and by incorporating fault resilience scheduling policy into the workflow scheduler. Using the extended FP2C, fault tolerance can be achieved without modifying applications.","PeriodicalId":347945,"journal":{"name":"ESPM '15","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Fault tolerance features of a new multi-SPMD programming/execution environment\",\"authors\":\"Miwako Tsuji, S. Petiton, M. Sato\",\"doi\":\"10.1145/2832241.2832243\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Supercomputers in the exascale era would consist of a huge number of nodes arranged in a multi-level hierarchy. There are many important challenges to exploit such systems such as scalability, programmability, reliability and energy efficiency. In the previous work, we had focused on the scalability and programmability. We had proposed FP2C (Framework for Post-Petascale Computing), which is a development and execution environment based on workflow and PGAS (Partitioned Global Address Space) programming models for parallel programming. In this paper, we focus on the reliability. We extend FP2C by adding a fault detection capability to the middleware of FP2C and by incorporating fault resilience scheduling policy into the workflow scheduler. Using the extended FP2C, fault tolerance can be achieved without modifying applications.\",\"PeriodicalId\":347945,\"journal\":{\"name\":\"ESPM '15\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ESPM '15\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2832241.2832243\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ESPM '15","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2832241.2832243","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

百亿亿次时代的超级计算机将由大量节点组成,这些节点以多层次的层次结构排列。开发这样的系统有许多重要的挑战,如可扩展性、可编程性、可靠性和能源效率。在之前的工作中,我们关注的是可伸缩性和可编程性。我们提出了FP2C (Post-Petascale Computing Framework),这是一个基于工作流和并行编程的PGAS (Partitioned Global Address Space)编程模型的开发和执行环境。在本文中,我们重点研究了可靠性。我们通过在FP2C中间件中增加故障检测功能和在工作流调度程序中加入故障恢复调度策略来扩展FP2C。使用扩展的FP2C,可以在不修改应用的情况下实现容错。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Fault tolerance features of a new multi-SPMD programming/execution environment
Supercomputers in the exascale era would consist of a huge number of nodes arranged in a multi-level hierarchy. There are many important challenges to exploit such systems such as scalability, programmability, reliability and energy efficiency. In the previous work, we had focused on the scalability and programmability. We had proposed FP2C (Framework for Post-Petascale Computing), which is a development and execution environment based on workflow and PGAS (Partitioned Global Address Space) programming models for parallel programming. In this paper, we focus on the reliability. We extend FP2C by adding a fault detection capability to the middleware of FP2C and by incorporating fault resilience scheduling policy into the workflow scheduler. Using the extended FP2C, fault tolerance can be achieved without modifying applications.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信