Understanding the Effects of Communication and Coordination on Checkpointing at Scale

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2014-11-16 DOI:10.1109/SC.2014.77

Kurt B. Ferreira, Patrick M. Widener, Scott Levy, D. Arnold, T. Hoefler

{"title":"Understanding the Effects of Communication and Coordination on Checkpointing at Scale","authors":"Kurt B. Ferreira, Patrick M. Widener, Scott Levy, D. Arnold, T. Hoefler","doi":"10.1109/SC.2014.77","DOIUrl":null,"url":null,"abstract":"Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid check pointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node's compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated check pointing has focused on optimizing message log volumes, local check pointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. We demonstrate how to tune hierarchical uncoordinated check pointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. Our work provides a critical analysis and comparison of coordinated and uncoordinated check pointing and enables users and system administrators to fine-tune the check pointing scheme to the application and system characteristics.","PeriodicalId":275261,"journal":{"name":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SC14: International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.2014.77","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid check pointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. However, few insights into selection and tuning of these protocols for applications at scale have emerged. In this paper, we use a simulation-based approach to show that local checkpoint activity in resilience mechanisms can significantly affect the performance of key workloads, even when less than 1% of a local node's compute time is allocated to resilience mechanisms (a very generous assumption). Specifically, we show that even though much work on uncoordinated check pointing has focused on optimizing message log volumes, local check pointing activity may dominate the overheads of this technique at scale. Our study shows that local checkpoints lead to process delays that can propagate through messaging relations to other processes causing a cascading series of delays. We demonstrate how to tune hierarchical uncoordinated check pointing protocols designed to reduce log volumes to significantly reduce these synchronization overheads at scale. Our work provides a critical analysis and comparison of coordinated and uncoordinated check pointing and enables users and system administrators to fine-tune the check pointing scheme to the application and system characteristics.

查看原文本刊更多论文

理解沟通和协调对大规模检查点的影响

容错是未来大规模系统面临的主要挑战。对协调、非协调和混合检查点系统的积极研究已经探索了异步的引入如何解决预期的可伸缩性问题。然而，针对大规模应用程序选择和调优这些协议的见解很少。在本文中，我们使用基于模拟的方法来显示弹性机制中的本地检查点活动可以显着影响关键工作负载的性能，即使分配给弹性机制的本地节点计算时间不到1%(一个非常慷慨的假设)。具体来说，我们表明，尽管许多关于非协调检查指向的工作都集中在优化消息日志量上，但本地检查指向活动可能会在规模上主导该技术的开销。我们的研究表明，本地检查点会导致流程延迟，这种延迟可以通过消息传递关系传播给其他流程，从而导致一系列级联延迟。我们将演示如何调优旨在减少日志量的分层非协调检查指向协议，以大规模地显著减少这些同步开销。我们的工作提供了对协调和非协调检查指向的关键分析和比较，并使用户和系统管理员能够根据应用程序和系统特征微调检查指向方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量