Team-Based Message Logging: Preliminary Results

Esteban Meneses, C. Mendes, L. Kalé
{"title":"Team-Based Message Logging: Preliminary Results","authors":"Esteban Meneses, C. Mendes, L. Kalé","doi":"10.1109/CCGRID.2010.110","DOIUrl":null,"url":null,"abstract":"Fault tolerance will be a fundamental imperative in the next decade as machines containing hundreds of thousands of cores will be installed at various locations. In this context, the traditional checkpoint/restart model does not seem to be a suitable option, since it makes all the processors roll back to their latest checkpoint in case of a single failure in one of the processors. In-memory message logging is an alternative that avoids this global restoration process and instead replays the messages to the failed processor. However, there is a large memory overhead associated with message logging because each message must be logged so it can be played back if a failure occurs. In this paper, we introduce a technique to alleviate the demand of memory in message logging by grouping processors into teams. These teams act as a failure unit: if one team member fails, all the other members in that team roll back to their latest checkpoint and start the recovery process. This eliminates the need to log message contents within teams. The savings in memory produced by this approach depend on the characteristics of the application, the number of messages sent per computation unit and size of those messages. We present promising results for multiple benchmarks. As an example, the NPB-CG code running class D on 512 cores manages to reduce the memory overhead of message logging by 62%.","PeriodicalId":444485,"journal":{"name":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","volume":"144 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"39","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2010.110","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 39

Abstract

Fault tolerance will be a fundamental imperative in the next decade as machines containing hundreds of thousands of cores will be installed at various locations. In this context, the traditional checkpoint/restart model does not seem to be a suitable option, since it makes all the processors roll back to their latest checkpoint in case of a single failure in one of the processors. In-memory message logging is an alternative that avoids this global restoration process and instead replays the messages to the failed processor. However, there is a large memory overhead associated with message logging because each message must be logged so it can be played back if a failure occurs. In this paper, we introduce a technique to alleviate the demand of memory in message logging by grouping processors into teams. These teams act as a failure unit: if one team member fails, all the other members in that team roll back to their latest checkpoint and start the recovery process. This eliminates the need to log message contents within teams. The savings in memory produced by this approach depend on the characteristics of the application, the number of messages sent per computation unit and size of those messages. We present promising results for multiple benchmarks. As an example, the NPB-CG code running class D on 512 cores manages to reduce the memory overhead of message logging by 62%.
基于团队的消息记录:初步结果
在未来十年,容错将是一个基本的必要条件,因为包含数十万个核心的机器将被安装在不同的地方。在这种情况下,传统的检查点/重新启动模型似乎不是一个合适的选择,因为如果其中一个处理器出现单个故障,它会使所有处理器回滚到最新的检查点。内存中消息日志记录是一种替代方法,它可以避免这种全局恢复过程,而是将消息重放到故障处理器。但是,与消息日志记录相关的内存开销很大,因为必须记录每条消息,以便在发生故障时回放。在本文中,我们介绍了一种通过分组处理器来减少消息日志对内存需求的技术。这些团队充当一个故障单元:如果一个团队成员失败,该团队中的所有其他成员回滚到他们最近的检查点并开始恢复过程。这消除了在团队中记录消息内容的需要。这种方法所节省的内存取决于应用程序的特征、每个计算单元发送的消息数量以及这些消息的大小。我们在多个基准测试中展示了有希望的结果。例如,在512核上运行类D的NPB-CG代码设法将消息日志的内存开销减少62%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信