Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters

R. Rajachandrasekar, Akshay Venkatesh, Khaled Hamidouche, D. Panda
{"title":"Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters","authors":"R. Rajachandrasekar, Akshay Venkatesh, Khaled Hamidouche, D. Panda","doi":"10.1109/CCGrid.2015.169","DOIUrl":null,"url":null,"abstract":"Checkpoint-restart is a predominantly used reactive fault-tolerance mechanism for applications running on HPC systems. While there are innumerable studies in literature that have analyzed, and optimized for, the performance and scalability of a variety of check pointing protocols, not much research has been done from an energy or power perspective. The limited number of studies conducted along this line have primarily analyzed and modeled power and energy usage during check pointing phases. Applications running on future exascale machines will be constrained by a power envelope, and it is not only important to understand the behavior of check pointing systems under such an envelope but to also adopt techniques that can leverage power capping capabilities exposed by the OS to achieve energy savings without forsaking performance. In this paper, we address the problem of marginal energy benefits with significant performance degradation due to naive application of power capping around check pointing phases by proposing a novel power-aware check pointing framework -- Power-Check. By use of data funnelling mechanisms and selective core power-capping, Power-Check makes efficient use of the I/O and CPU subsystem. Evaluations with application kernels show that Power-Check can yield as much as 48% reduction in the amount of energy consumed during a checkpoint, while improving the check pointing performance by 14%.","PeriodicalId":6664,"journal":{"name":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","volume":"256 1","pages":"261-270"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2015.169","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 28

Abstract

Checkpoint-restart is a predominantly used reactive fault-tolerance mechanism for applications running on HPC systems. While there are innumerable studies in literature that have analyzed, and optimized for, the performance and scalability of a variety of check pointing protocols, not much research has been done from an energy or power perspective. The limited number of studies conducted along this line have primarily analyzed and modeled power and energy usage during check pointing phases. Applications running on future exascale machines will be constrained by a power envelope, and it is not only important to understand the behavior of check pointing systems under such an envelope but to also adopt techniques that can leverage power capping capabilities exposed by the OS to achieve energy savings without forsaking performance. In this paper, we address the problem of marginal energy benefits with significant performance degradation due to naive application of power capping around check pointing phases by proposing a novel power-aware check pointing framework -- Power-Check. By use of data funnelling mechanisms and selective core power-capping, Power-Check makes efficient use of the I/O and CPU subsystem. Evaluations with application kernels show that Power-Check can yield as much as 48% reduction in the amount of energy consumed during a checkpoint, while improving the check pointing performance by 14%.
功率检查:高性能计算集群的节能检查点框架
检查点重新启动是一种主要用于运行在HPC系统上的应用程序的反应性容错机制。虽然文献中有无数的研究分析和优化了各种检查点协议的性能和可伸缩性,但从能源或电力的角度进行的研究并不多。沿着这条路线进行的有限数量的研究主要分析和模拟了检查点阶段的电力和能源使用情况。在未来的百亿亿级机器上运行的应用程序将受到功率包络的限制,不仅要了解在这种包络下检查点系统的行为,而且要采用可以利用操作系统暴露的功率上限功能的技术,在不牺牲性能的情况下实现节能,这一点很重要。在本文中,我们通过提出一种新颖的功率感知检查指向框架- power- check,解决了由于在检查指向阶段周围天真地应用功率封顶而导致的显著性能下降的边际能源效益问题。通过使用数据漏斗机制和选择性核心功率封顶,Power-Check可以有效地利用I/O和CPU子系统。对应用程序内核的评估表明,Power-Check可以在检查点期间减少多达48%的能耗,同时将检查点性能提高14%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信