Team-Based Message Logging: Preliminary Results

2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing Pub Date : 2010-05-17 DOI:10.1109/CCGRID.2010.110

Esteban Meneses, C. Mendes, L. Kalé

{"title":"Team-Based Message Logging: Preliminary Results","authors":"Esteban Meneses, C. Mendes, L. Kalé","doi":"10.1109/CCGRID.2010.110","DOIUrl":null,"url":null,"abstract":"Fault tolerance will be a fundamental imperative in the next decade as machines containing hundreds of thousands of cores will be installed at various locations. In this context, the traditional checkpoint/restart model does not seem to be a suitable option, since it makes all the processors roll back to their latest checkpoint in case of a single failure in one of the processors. In-memory message logging is an alternative that avoids this global restoration process and instead replays the messages to the failed processor. However, there is a large memory overhead associated with message logging because each message must be logged so it can be played back if a failure occurs. In this paper, we introduce a technique to alleviate the demand of memory in message logging by grouping processors into teams. These teams act as a failure unit: if one team member fails, all the other members in that team roll back to their latest checkpoint and start the recovery process. This eliminates the need to log message contents within teams. The savings in memory produced by this approach depend on the characteristics of the application, the number of messages sent per computation unit and size of those messages. We present promising results for multiple benchmarks. As an example, the NPB-CG code running class D on 512 cores manages to reduce the memory overhead of message logging by 62%.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"144 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2010.110","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 39

Abstract

Fault tolerance will be a fundamental imperative in the next decade as machines containing hundreds of thousands of cores will be installed at various locations. In this context, the traditional checkpoint/restart model does not seem to be a suitable option, since it makes all the processors roll back to their latest checkpoint in case of a single failure in one of the processors. In-memory message logging is an alternative that avoids this global restoration process and instead replays the messages to the failed processor. However, there is a large memory overhead associated with message logging because each message must be logged so it can be played back if a failure occurs. In this paper, we introduce a technique to alleviate the demand of memory in message logging by grouping processors into teams. These teams act as a failure unit: if one team member fails, all the other members in that team roll back to their latest checkpoint and start the recovery process. This eliminates the need to log message contents within teams. The savings in memory produced by this approach depend on the characteristics of the application, the number of messages sent per computation unit and size of those messages. We present promising results for multiple benchmarks. As an example, the NPB-CG code running class D on 512 cores manages to reduce the memory overhead of message logging by 62%.

查看原文本刊更多论文

基于团队的消息记录:初步结果

在未来十年，容错将是一个基本的必要条件，因为包含数十万个核心的机器将被安装在不同的地方。在这种情况下，传统的检查点/重新启动模型似乎不是一个合适的选择，因为如果其中一个处理器出现单个故障，它会使所有处理器回滚到最新的检查点。内存中消息日志记录是一种替代方法，它可以避免这种全局恢复过程，而是将消息重放到故障处理器。但是，与消息日志记录相关的内存开销很大，因为必须记录每条消息，以便在发生故障时回放。在本文中，我们介绍了一种通过分组处理器来减少消息日志对内存需求的技术。这些团队充当一个故障单元:如果一个团队成员失败，该团队中的所有其他成员回滚到他们最近的检查点并开始恢复过程。这消除了在团队中记录消息内容的需要。这种方法所节省的内存取决于应用程序的特征、每个计算单元发送的消息数量以及这些消息的大小。我们在多个基准测试中展示了有希望的结果。例如，在512核上运行类D的NPB-CG代码设法将消息日志的内存开销减少62%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

自引率

0.00%

发文量