Techniques to improve the scalability of collective checkpointing at large scale

2015 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2015-07-20 DOI:10.1109/HPCSim.2015.7237113

Bogdan Nicolae

{"title":"Techniques to improve the scalability of collective checkpointing at large scale","authors":"Bogdan Nicolae","doi":"10.1109/HPCSim.2015.7237113","DOIUrl":null,"url":null,"abstract":"Scientific and data-intensive computing have matured over the last couple of years in all fields of science and industry. Their rapid increase in complexity and scale has prompted ongoing efforts dedicated to reach exascale infrastructure capability by the end of the decade. However, advances in this context are not homogeneous: I/O capabilities in terms of networking and storage are lagging behind computational power and are often considered a major limitation that that persists even at petascale [1]. A particularly difficult challenge in this context are collective I/O access patterns (which we henceforth refer to as collective checkpointing) where all processes simultaneously dump large amounts of related data simultaneously to persistent storage. This pattern is often exhibited by large-scale, bulk-synchronous applications in a variety of circumstances, e.g., when they use checkpoint-restart fault tolerance techniques to save intermediate computational states at regular time intervals [2] or when intermediate, globally synchronized results are needed during the lifetime of the computation (e.g. to understand how a simulation progresses during key phases). Under such circumstances, a decoupled storage system (e.g. a parallel file system such as GPFS [3] or a specialized storage system such as BlobSeer [4]) does not provide sufficient I/O bandwidth to handle the explosion of data sizes: for example, Jones et al. [5] predict dump times in the order of several hours. In order to overcome the I/O bandwidth limitation, one potential solution is to equip the compute nodes with local storage (i.e., HDDs, SSDs, NVMs, etc.) or use I/O forwarding nodes. Using this approach, a large part of the data can be dumped locally, which completely avoids the need to consume and compete for the I/O bandwidth of a decoupled storage system. However, this is not without drawbacks: the local storage devices or I/O forwarding nodes are prone to failures and as such the data they hold is volatile. Thus, a popular approach in practice is to wait until the local dump has finished, then let the application continue while the checkpoints are in turn dumped to a parallel file system in background. Such a straightforward solution can be effective at hiding the overhead incurred to due I/O bandwidth limitations, but this not necessarily the case: it may happen that there is not enough time to fully flush everything to the parallel file system before the next collective checkpoint request is issued. In fact, this a likely scenario with growing scale, as the failure rate increases, which introduces the need to checkpoint at smaller intervals in order to compensate for this effect. Furthermore, a smaller checkpoint interval also means local dumps are frequent and as such their overhead becomes significant itself.","PeriodicalId":134009,"journal":{"name":"2015 International Conference on High Performance Computing & Simulation (HPCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCSim.2015.7237113","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Scientific and data-intensive computing have matured over the last couple of years in all fields of science and industry. Their rapid increase in complexity and scale has prompted ongoing efforts dedicated to reach exascale infrastructure capability by the end of the decade. However, advances in this context are not homogeneous: I/O capabilities in terms of networking and storage are lagging behind computational power and are often considered a major limitation that that persists even at petascale [1]. A particularly difficult challenge in this context are collective I/O access patterns (which we henceforth refer to as collective checkpointing) where all processes simultaneously dump large amounts of related data simultaneously to persistent storage. This pattern is often exhibited by large-scale, bulk-synchronous applications in a variety of circumstances, e.g., when they use checkpoint-restart fault tolerance techniques to save intermediate computational states at regular time intervals [2] or when intermediate, globally synchronized results are needed during the lifetime of the computation (e.g. to understand how a simulation progresses during key phases). Under such circumstances, a decoupled storage system (e.g. a parallel file system such as GPFS [3] or a specialized storage system such as BlobSeer [4]) does not provide sufficient I/O bandwidth to handle the explosion of data sizes: for example, Jones et al. [5] predict dump times in the order of several hours. In order to overcome the I/O bandwidth limitation, one potential solution is to equip the compute nodes with local storage (i.e., HDDs, SSDs, NVMs, etc.) or use I/O forwarding nodes. Using this approach, a large part of the data can be dumped locally, which completely avoids the need to consume and compete for the I/O bandwidth of a decoupled storage system. However, this is not without drawbacks: the local storage devices or I/O forwarding nodes are prone to failures and as such the data they hold is volatile. Thus, a popular approach in practice is to wait until the local dump has finished, then let the application continue while the checkpoints are in turn dumped to a parallel file system in background. Such a straightforward solution can be effective at hiding the overhead incurred to due I/O bandwidth limitations, but this not necessarily the case: it may happen that there is not enough time to fully flush everything to the parallel file system before the next collective checkpoint request is issued. In fact, this a likely scenario with growing scale, as the failure rate increases, which introduces the need to checkpoint at smaller intervals in order to compensate for this effect. Furthermore, a smaller checkpoint interval also means local dumps are frequent and as such their overhead becomes significant itself.

查看原文本刊更多论文

提高大规模集体检查点可扩展性的技术

在过去的几年中，科学和数据密集型计算在科学和工业的所有领域都已经成熟。它们的复杂性和规模的迅速增加促使人们不断努力，致力于在本十年末达到百亿亿次基础设施的能力。然而，在这方面的进展并不均匀:网络和存储方面的I/O能力落后于计算能力，并且通常被认为是一个主要的限制，即使在千兆级(peascale)上也是如此[1]。在这种情况下，一个特别困难的挑战是集体I/O访问模式(我们今后将其称为集体检查点)，其中所有进程同时将大量相关数据转储到持久存储中。这种模式经常出现在各种情况下的大规模批量同步应用程序中，例如，当它们使用检查点重新启动容错技术以定期时间间隔保存中间计算状态[2]时，或者在计算生命周期中需要中间的全局同步结果时(例如，为了了解在关键阶段模拟的进展情况)。在这种情况下，解耦的存储系统(例如GPFS[3]这样的并行文件系统或BlobSeer[4]这样的专用存储系统)不能提供足够的I/O带宽来处理数据大小的爆炸:例如，Jones等人[5]预测转储时间以几个小时为单位。为了克服I/O带宽限制，一个可能的解决方案是为计算节点配备本地存储(即hdd、ssd、nvm等)或使用I/O转发节点。使用这种方法，大部分数据可以在本地转储，从而完全避免了对解耦存储系统的I/O带宽的消耗和竞争。然而，这并非没有缺点:本地存储设备或I/O转发节点容易出现故障，因此它们保存的数据是不稳定的。因此，在实践中，一种流行的方法是等待本地转储完成，然后让应用程序继续运行，而检查点则依次转储到后台的并行文件系统。这种直接的解决方案可以有效地隐藏由于I/O带宽限制而产生的开销，但情况并非如此:在发出下一个集体检查点请求之前，可能没有足够的时间将所有内容完全刷新到并行文件系统。事实上，随着失败率的增加，随着规模的增长，这种情况很可能出现，这就需要以更小的间隔检查点来补偿这种影响。此外，更小的检查点间隔也意味着频繁地进行本地转储，因此它们的开销本身就变得很大。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量