RCMP: Enabling Efficient Recomputation Based Failure Resilience for Big Data Analytics

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.102

Florin Dinu, T. Ng

{"title":"RCMP: Enabling Efficient Recomputation Based Failure Resilience for Big Data Analytics","authors":"Florin Dinu, T. Ng","doi":"10.1109/IPDPS.2014.102","DOIUrl":null,"url":null,"abstract":"Data replication, the main failure resilience strategy used for big data analytics jobs, can be unnecessarily inefficient. It can cause serious performance degradation when applied to intermediate job outputs in multi-job computations. For instance, for I/O-intensive big data jobs, data replication is especially expensive because very large datasets need to be replicated. Reducing the number of replicas is not a satisfactory solution as it only aggravates a fundamental limitation of data replication: its failure resilience guarantees are limited by the number of available replicas. When all replicas of some piece of intermediate job output are lost, cascading job recomputations may be required for recovery. In this paper we show how job recomputation can be made a first-order failure resilience strategy for big data analytics. The need for data replication can thus be significantly reduced. We present RCMP, a system that performs efficient job recomputation. RCMP can persist task outputs across jobs and leverage them to minimize the work performed during job recomputations. More importantly, RCMP addresses two important challenges that appear during job recomputations. The first is efficiently utilizing the available compute node parallelism. The second is dealing with hot-spots. RCMP handles both by switching to a finer-grained task scheduling granularity for recomputations. Our experiments show that RCMP's benefits hold across two different clusters, for job inputs as small as 40GB or as large as 1.2TB. Compared to RCMP, data replication is 30%-100% worse during failure-free periods. More importantly, by efficiently performing recomputations, RCMP is comparable or better even under single and double data loss events.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

Data replication, the main failure resilience strategy used for big data analytics jobs, can be unnecessarily inefficient. It can cause serious performance degradation when applied to intermediate job outputs in multi-job computations. For instance, for I/O-intensive big data jobs, data replication is especially expensive because very large datasets need to be replicated. Reducing the number of replicas is not a satisfactory solution as it only aggravates a fundamental limitation of data replication: its failure resilience guarantees are limited by the number of available replicas. When all replicas of some piece of intermediate job output are lost, cascading job recomputations may be required for recovery. In this paper we show how job recomputation can be made a first-order failure resilience strategy for big data analytics. The need for data replication can thus be significantly reduced. We present RCMP, a system that performs efficient job recomputation. RCMP can persist task outputs across jobs and leverage them to minimize the work performed during job recomputations. More importantly, RCMP addresses two important challenges that appear during job recomputations. The first is efficiently utilizing the available compute node parallelism. The second is dealing with hot-spots. RCMP handles both by switching to a finer-grained task scheduling granularity for recomputations. Our experiments show that RCMP's benefits hold across two different clusters, for job inputs as small as 40GB or as large as 1.2TB. Compared to RCMP, data replication is 30%-100% worse during failure-free periods. More importantly, by efficiently performing recomputations, RCMP is comparable or better even under single and double data loss events.

查看原文本刊更多论文

RCMP:为大数据分析实现基于故障恢复的高效重计算

数据复制是用于大数据分析工作的主要故障恢复策略，可能会产生不必要的低效率。当应用于多作业计算中的中间作业输出时，它可能导致严重的性能下降。例如，对于I/ o密集型大数据作业，数据复制的成本特别高，因为需要复制非常大的数据集。减少副本的数量并不是一个令人满意的解决方案，因为它只会加剧数据复制的一个基本限制:它的故障弹性保证受到可用副本数量的限制。当某些中间作业输出的所有副本丢失时，可能需要级联作业重新计算以进行恢复。在本文中，我们展示了如何将作业重计算作为大数据分析的一阶故障恢复策略。因此，可以大大减少对数据复制的需求。提出了一种高效的作业重计算系统RCMP。RCMP可以跨作业持久化任务输出，并利用它们最小化作业重新计算期间执行的工作。更重要的是，RCMP解决了在工作重新计算过程中出现的两个重要挑战。首先是有效地利用可用的计算节点并行性。二是处理热点问题。RCMP通过切换到更细粒度的任务调度粒度进行重新计算来处理这两种情况。我们的实验表明，RCMP的优势适用于两个不同的集群，作业输入小至40GB，大至1.2TB。与RCMP相比，无故障期间的数据复制性能要差30%-100%。更重要的是，通过有效地执行重新计算，RCMP即使在单次和双次数据丢失事件下也可以媲美或更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量