高性能计算系统容错协议的能效评估

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI:10.1109/SBAC-PAD.2012.12

Esteban Meneses, O. Sarood, L. Kalé

{"title":"高性能计算系统容错协议的能效评估","authors":"Esteban Meneses, O. Sarood, L. Kalé","doi":"10.1109/SBAC-PAD.2012.12","DOIUrl":null,"url":null,"abstract":"An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"47","resultStr":"{\"title\":\"Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems\",\"authors\":\"Esteban Meneses, O. Sarood, L. Kalé\",\"doi\":\"10.1109/SBAC-PAD.2012.12\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.\",\"PeriodicalId\":232444,\"journal\":{\"name\":\"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-10-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"47\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SBAC-PAD.2012.12\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2012.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 47

摘要

一台百亿亿次计算机预计将在2018-2020年交付。这样的机器将能够解决一些最难的计算问题，并扩展我们对自然和宇宙的理解。然而，为了实现这一目标，HPC社区必须解决几个重要的挑战。弹性将成为一个突出的问题，因为一个百亿亿级的机器将经历频繁的故障，因为它将包含大量的组件。必须在系统中加入某种形式的容错，以保持应用程序的进度率尽可能高。同时，系统在电源管理方面也必须更加小心。权力有两个维度。首先，在功率有限的环境中，系统的所有层都必须遵守该限制(包括容错层)。其次，由于能源消耗，电力将是相关的:一个百亿亿次的安装将不得不支付一大笔能源账单。提高我们对不同容错方案的能量分布的理解是至关重要的。本文介绍了三种不同的容错方法:检查点/重启、消息日志记录和并行恢复。通过使用来自不同编程模型的程序，我们展示了并行恢复是执行失败时最节能的解决方案。同时，并行恢复能够比其他方法更快地完成执行。我们使用分析模型探索这些方法在极端尺度下的行为。在大规模情况下，与检查点/重新启动相比，并行恢复预计将使应用程序的总执行时间减少17%，能耗减少13%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems

An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing

自引率

0.00%

发文量