Dataset Scaling and MapReduce Performance

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI:10.1109/IPDPSW.2013.143

Fan Zhang, M. Sakr

引用次数: 9

Abstract

Predicting execution behavior of MapReduce applications when scaling the input dataset presents a challenging problem. The difficulty lies in the distributed locations of input data and the distributed, virtualized compute resources that utilize different network substrates. The potential payoff lies in using small datasets and limited test runs to understand how applications will behave with "big data." Our research has developed an in-depth understanding of MapReduce application performance and analyzed the impact of scaling input datasets. While we might expect that "embarrassingly parallel" MapReduce jobs should scale linearly with input dataset size, our results show that execution time sometimes increases nonlinearly. To verify our predictions, we identify a benchmark set of Map-, Shuffle-, and Reduce-intensive applications. Experimental results show that our execution-time analysis distinguishes four typical application behaviors when scaling input datasets.

查看原文本刊更多论文

数据集缩放和MapReduce性能

预测MapReduce应用程序在扩展输入数据集时的执行行为是一个具有挑战性的问题。难点在于输入数据的分布位置和使用不同网络基板的分布式虚拟化计算资源。潜在的回报在于使用小数据集和有限的测试运行来了解应用程序如何处理“大数据”。我们的研究对MapReduce应用程序的性能有了深入的了解，并分析了缩放输入数据集的影响。虽然我们可能期望“令人尴尬的并行”MapReduce作业应该随着输入数据集的大小线性扩展，但我们的结果表明，执行时间有时会非线性地增加。为了验证我们的预测，我们确定了Map、Shuffle和reduce密集型应用程序的基准集。实验结果表明，我们的执行时间分析区分了四种典型的应用程序在缩放输入数据集时的行为。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum

自引率

0.00%

发文量