Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing Pub Date : 2017-06-26 DOI:10.1145/3078597.3078600

Bogdan Ghit, D. Epema

{"title":"Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks","authors":"Bogdan Ghit, D. Epema","doi":"10.1145/3078597.3078600","DOIUrl":null,"url":null,"abstract":"Providing fault-tolerance is of major importance for data analytics frameworks such as Hadoop and Spark, which are typically deployed in large clusters that are known to experience high failures rates. Unexpected events such as compute node failures are in particular an important challenge for in-memory data analytics frameworks, as the widely adopted approach to deal with them is to recompute work already done. Recomputing lost work, however, requires allocation of extra resource to re-execute tasks, thus increasing the job runtimes. To address this problem, we design a checkpointing system called Panda that is tailored to the intrinsic characteristics of data analytics frameworks. In particular, Panda employs fine-grained checkpointing at the level of task outputs and dynamically identifies tasks that are worthwhile to be checkpointed rather than be recomputed. As has been abundantly shown, tasks of data analytics jobs may have very variable runtimes and output sizes. These properties form the basis of three checkpointing policies which we incorporate into Panda. We first empirically evaluate Panda on a multicluster system with single data analytics applications under space-correlated failures, and find that Panda is close to the performance of a fail-free execution in unmodified Spark for a large range of concurrent failures. Then we perform simulations of complete workloads, mimicking the size and operation of a Google cluster, and show that Panda provides significant improvements in the average job runtime for wide ranges of the failure rate and system load.","PeriodicalId":436194,"journal":{"name":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3078597.3078600","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 17

Abstract

Providing fault-tolerance is of major importance for data analytics frameworks such as Hadoop and Spark, which are typically deployed in large clusters that are known to experience high failures rates. Unexpected events such as compute node failures are in particular an important challenge for in-memory data analytics frameworks, as the widely adopted approach to deal with them is to recompute work already done. Recomputing lost work, however, requires allocation of extra resource to re-execute tasks, thus increasing the job runtimes. To address this problem, we design a checkpointing system called Panda that is tailored to the intrinsic characteristics of data analytics frameworks. In particular, Panda employs fine-grained checkpointing at the level of task outputs and dynamically identifies tasks that are worthwhile to be checkpointed rather than be recomputed. As has been abundantly shown, tasks of data analytics jobs may have very variable runtimes and output sizes. These properties form the basis of three checkpointing policies which we incorporate into Panda. We first empirically evaluate Panda on a multicluster system with single data analytics applications under space-correlated failures, and find that Panda is close to the performance of a fail-free execution in unmodified Spark for a large range of concurrent failures. Then we perform simulations of complete workloads, mimicking the size and operation of a Google cluster, and show that Panda provides significant improvements in the average job runtime for wide ranges of the failure rate and system load.

查看原文本刊更多论文

安全总比后悔好:应对内存数据分析框架的失败

提供容错功能对于Hadoop和Spark等数据分析框架非常重要，因为它们通常部署在故障率很高的大型集群中。对于内存数据分析框架来说，诸如计算节点故障之类的意外事件是一个特别重要的挑战，因为处理它们的广泛采用的方法是重新计算已经完成的工作。但是，重新计算丢失的工作需要分配额外的资源来重新执行任务，从而增加了作业运行时间。为了解决这个问题，我们设计了一个名为Panda的检查点系统，该系统是根据数据分析框架的内在特征量身定制的。特别是，Panda在任务输出级别使用细粒度检查点，并动态识别值得检查点而不是重新计算的任务。正如已经充分显示的那样，数据分析作业的任务可能具有非常可变的运行时和输出大小。这些属性构成了我们合并到Panda中的三个检查点策略的基础。我们首先在一个多集群系统上对Panda进行了经验评估，该系统在空间相关故障下具有单个数据分析应用程序，并发现对于大范围的并发故障，Panda接近于在未修改的Spark中无故障执行的性能。然后，我们对完整的工作负载进行模拟，模拟谷歌集群的大小和操作，并显示Panda在广泛的故障率和系统负载范围内提供了显著的平均作业运行时改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

自引率

0.00%

发文量