AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

Bogdan Nicolae, F. Cappello
{"title":"AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing","authors":"Bogdan Nicolae, F. Cappello","doi":"10.1145/2493123.2462918","DOIUrl":null,"url":null,"abstract":"With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, we launch the assumption that first-time writes to memory during asynchronous checkpointing generate the same kind of interference as they did in past iterations. Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed to stable storage. Large scale experiments show up to 60% improvement when compared to state-of-art checkpointing approaches, all this achievable with an extra memory requirement of less than 5% of the total application memory.","PeriodicalId":199475,"journal":{"name":"Proceedings of the 22nd international symposium on High-performance parallel and distributed computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd international symposium on High-performance parallel and distributed computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2493123.2462918","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26

Abstract

With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional challenges in this context: not only does it need to minimize the performance overhead on the application due to checkpointing, but it also needs to operate with scarce resources. Given the iterative nature of the targeted applications, we launch the assumption that first-time writes to memory during asynchronous checkpointing generate the same kind of interference as they did in past iterations. Based on this assumption, we propose novel asynchronous checkpointing approach that leverages both current and past access pattern trends in order to optimize the order in which memory pages are flushed to stable storage. Large scale experiments show up to 60% improvement when compared to state-of-art checkpointing approaches, all this achievable with an extra memory requirement of less than 5% of the total application memory.
AI-Ckpt:利用内存访问模式实现自适应异步增量检查点
随着超级计算和云计算架构的规模和复杂度不断提高,故障频繁发生,可靠性成为一个难题。尽管对于某些应用程序来说,重新启动失败的任务就足够了,但是有一大类应用程序的任务运行时间很长,或者是紧耦合的,因此从头开始重新启动是不可行的。检查点重新启动(Checkpoint-Restart, CR)是这类应用程序在故障中幸存下来的主要方法,在这种情况下它面临着额外的挑战:它不仅需要最小化由于检查点而对应用程序造成的性能开销,而且还需要使用稀缺的资源进行操作。考虑到目标应用程序的迭代特性,我们假设在异步检查点期间首次写入内存会产生与过去迭代中相同的干扰。基于这一假设,我们提出了一种新的异步检查点方法,该方法利用当前和过去的访问模式趋势,以优化将内存页面刷新到稳定存储的顺序。大规模实验表明,与目前最先进的检查点方法相比,这种方法的改进幅度高达60%,所有这些都是在额外内存需求不到应用程序总内存的5%的情况下实现的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信