Adoption protocols for fanout-optimal fault-tolerant termination detection

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2013-02-23 DOI:10.1145/2442516.2442519

J. Lifflander, P. Miller, L. Kalé

{"title":"Adoption protocols for fanout-optimal fault-tolerant termination detection","authors":"J. Lifflander, P. Miller, L. Kalé","doi":"10.1145/2442516.2442519","DOIUrl":null,"url":null,"abstract":"Termination detection is relevant for signaling completion (all processors are idle and no messages are in flight) of many operations in distributed systems, including work stealing algorithms, dynamic data exchange, and dynamically structured computations. In the face of growing supercomputers with increasing likelihood that each job may encounter faults, it is important for high-performance computing applications that rely on termination detection that such an algorithm be able to tolerate the inevitable faults. We provide a trio of new practical fault tolerance schemes for a standard approach to termination detection that are easy to implement, present low overhead in both theory and practice, and have scalable costs when recovering from faults. These schemes tolerate all single-process faults, and are probabilistically tolerant of faults affecting multiple processes. We combine the theoretical failure probabilities we can calculate for each algorithm with historical fault records from real machines to show that these algorithms have excellent overall survivability.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"280 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2442516.2442519","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

Termination detection is relevant for signaling completion (all processors are idle and no messages are in flight) of many operations in distributed systems, including work stealing algorithms, dynamic data exchange, and dynamically structured computations. In the face of growing supercomputers with increasing likelihood that each job may encounter faults, it is important for high-performance computing applications that rely on termination detection that such an algorithm be able to tolerate the inevitable faults. We provide a trio of new practical fault tolerance schemes for a standard approach to termination detection that are easy to implement, present low overhead in both theory and practice, and have scalable costs when recovering from faults. These schemes tolerate all single-process faults, and are probabilistically tolerant of faults affecting multiple processes. We combine the theoretical failure probabilities we can calculate for each algorithm with historical fault records from real machines to show that these algorithms have excellent overall survivability.

查看原文本刊更多论文

采用扇出最优容错终止检测协议

终止检测与分布式系统中许多操作的信令完成(所有处理器都处于空闲状态，没有消息在传递)相关，包括工作窃取算法、动态数据交换和动态结构化计算。面对越来越多的超级计算机，每个作业遇到故障的可能性越来越大，对于依赖于终止检测的高性能计算应用程序来说，这样的算法能够容忍不可避免的故障是很重要的。我们为终止检测的标准方法提供了三种新的实用容错方案，这些方案易于实现，在理论和实践中都具有较低的开销，并且在从故障中恢复时具有可伸缩的成本。这些方案容忍所有单进程故障，并在概率上容忍影响多进程的故障。我们将每种算法的理论故障概率与真实机器的历史故障记录相结合，表明这些算法具有良好的整体生存能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming

自引率

0.00%

发文量