Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI:10.1109/CLUSTER.2011.39

Esteban Meneses, L. Kalé, G. Bronevetsky

{"title":"Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications","authors":"Esteban Meneses, L. Kalé, G. Bronevetsky","doi":"10.1109/CLUSTER.2011.39","DOIUrl":null,"url":null,"abstract":"Computing systems will grow significantly larger in the near future to satisfy the needs of computational scientists in areas like climate modeling, biophysics and cosmology. Supercomputers being installed in the next few years will comprise millions of cores, hundreds of thousands of processor chips and millions of physical components. However, it is expected that failures become more prevalent in those machines to the point where 10% of an Exascale system will be wasted just recovering from failures. Further, with such large numbers of cores, fine-grained and dynamic load balance will become increasingly critical for maintaining good system utilization. This paper addresses both fault tolerance and load balancing by presenting a novel extension of traditional message logging protocols based on team check pointing. Message logging makes it possible to recover from localized failures by rolling back just the failed processing elements. Since this comes at a high memory overhead from logging all communication, we reduce this cost by organizing processing elements into teams and only logging messages between teams. Further, we show how to dynamically partition the application into teams to simultaneously minimize the cost of fault tolerance and to balance application load. We experimentally show that this scheme has low overhead and can dramatically reduce the memory cost of message logging.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2011.39","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Computing systems will grow significantly larger in the near future to satisfy the needs of computational scientists in areas like climate modeling, biophysics and cosmology. Supercomputers being installed in the next few years will comprise millions of cores, hundreds of thousands of processor chips and millions of physical components. However, it is expected that failures become more prevalent in those machines to the point where 10% of an Exascale system will be wasted just recovering from failures. Further, with such large numbers of cores, fine-grained and dynamic load balance will become increasingly critical for maintaining good system utilization. This paper addresses both fault tolerance and load balancing by presenting a novel extension of traditional message logging protocols based on team check pointing. Message logging makes it possible to recover from localized failures by rolling back just the failed processing elements. Since this comes at a high memory overhead from logging all communication, we reduce this cost by organizing processing elements into teams and only logging messages between teams. Further, we show how to dynamically partition the application into teams to simultaneously minimize the cost of fault tolerance and to balance application load. We experimentally show that this scheme has low overhead and can dramatically reduce the memory cost of message logging.

查看原文本刊更多论文

在容错HPC应用程序中优化消息记录的动态负载平衡

在不久的将来，计算系统将变得越来越大，以满足气候建模、生物物理学和宇宙学等领域的计算科学家的需求。未来几年安装的超级计算机将由数百万个核心、数十万个处理器芯片和数百万个物理部件组成。然而，预计故障将在这些机器中变得更加普遍，以至于仅从故障中恢复就会浪费一个百亿亿级系统10%的时间。此外，对于如此大量的内核，细粒度和动态负载平衡对于维护良好的系统利用率将变得越来越重要。本文提出了一种基于团队检查指向的传统消息日志协议的新扩展，同时解决了容错和负载平衡问题。通过回滚失败的处理元素，消息日志记录可以从局部故障中恢复。由于记录所有通信会带来很高的内存开销，因此我们通过将处理元素组织到团队中并且只记录团队之间的消息来减少这种开销。此外，我们将展示如何动态地将应用程序划分为多个团队，以同时最小化容错成本并平衡应用程序负载。实验表明，该方案开销低，可以显著降低消息日志的内存开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量