Optimization Techniques within the Hadoop Eco-system: A Survey

2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing Pub Date : 2014-09-01 DOI:10.1109/SYNASC.2014.65

Giulia Rumi, Claudia Colella, D. Ardagna

{"title":"Optimization Techniques within the Hadoop Eco-system: A Survey","authors":"Giulia Rumi, Claudia Colella, D. Ardagna","doi":"10.1109/SYNASC.2014.65","DOIUrl":null,"url":null,"abstract":"Nowadays, we live in a digital world producing data at an impressive speed: data are large, change quickly, and are often too complex to be processed by existing tools. The problem is to extract knowledge from all these data in an efficient way. MapReduce is a data parallel programming model for clusters of commodity machines that was created to address this problem. In this paper we provide an overview of the Hadoop ecosystem. We introduce the most significative approaches supporting automatic, on-line resource provisioning. Moreover, we analyse optimization approaches proposed in frameworks built on top of MapReduce, such as Pig and Hive, which point out the importance of scheduling techniques in MapReduce when multiple workflows are executed concurrently. Therefore, the default Hadoop schedulers are discussed along with some enhancements proposed by the research community. The analysis is performed to highlight how research contributions try to address common Hadoop points of weakness. As it stands out from our comparison, none of the frameworks surpasses the others and a fair evaluation is also difficult to be performed, the choice of the framework must be related to the specific application goal but there is no single solution that addresses all the issues typical of MapReduce.","PeriodicalId":150575,"journal":{"name":"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2014.65","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Nowadays, we live in a digital world producing data at an impressive speed: data are large, change quickly, and are often too complex to be processed by existing tools. The problem is to extract knowledge from all these data in an efficient way. MapReduce is a data parallel programming model for clusters of commodity machines that was created to address this problem. In this paper we provide an overview of the Hadoop ecosystem. We introduce the most significative approaches supporting automatic, on-line resource provisioning. Moreover, we analyse optimization approaches proposed in frameworks built on top of MapReduce, such as Pig and Hive, which point out the importance of scheduling techniques in MapReduce when multiple workflows are executed concurrently. Therefore, the default Hadoop schedulers are discussed along with some enhancements proposed by the research community. The analysis is performed to highlight how research contributions try to address common Hadoop points of weakness. As it stands out from our comparison, none of the frameworks surpasses the others and a fair evaluation is also difficult to be performed, the choice of the framework must be related to the specific application goal but there is no single solution that addresses all the issues typical of MapReduce.

查看原文本刊更多论文

Hadoop生态系统中的优化技术:调查

如今，我们生活在一个以惊人的速度产生数据的数字世界:数据量大，变化快，而且往往太复杂，无法用现有工具处理。问题是如何以一种有效的方式从所有这些数据中提取知识。MapReduce是一种用于商用机器集群的数据并行编程模型，它的创建就是为了解决这个问题。在本文中，我们概述了Hadoop生态系统。我们介绍了支持自动在线资源供应的最有意义的方法。此外，我们分析了基于MapReduce框架的优化方法，如Pig和Hive，指出了调度技术在MapReduce中并发执行多个工作流时的重要性。因此，我们将讨论默认的Hadoop调度器以及研究社区提出的一些增强功能。执行分析是为了突出研究贡献如何试图解决常见的Hadoop弱点。从我们的比较中可以看出，没有一个框架能超越其他框架，公平的评估也很难执行，框架的选择必须与特定的应用程序目标相关，但是没有单一的解决方案可以解决MapReduce的所有典型问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing

自引率

0.00%

发文量