{"title":"Optimization Techniques within the Hadoop Eco-system: A Survey","authors":"Giulia Rumi, Claudia Colella, D. Ardagna","doi":"10.1109/SYNASC.2014.65","DOIUrl":null,"url":null,"abstract":"Nowadays, we live in a digital world producing data at an impressive speed: data are large, change quickly, and are often too complex to be processed by existing tools. The problem is to extract knowledge from all these data in an efficient way. MapReduce is a data parallel programming model for clusters of commodity machines that was created to address this problem. In this paper we provide an overview of the Hadoop ecosystem. We introduce the most significative approaches supporting automatic, on-line resource provisioning. Moreover, we analyse optimization approaches proposed in frameworks built on top of MapReduce, such as Pig and Hive, which point out the importance of scheduling techniques in MapReduce when multiple workflows are executed concurrently. Therefore, the default Hadoop schedulers are discussed along with some enhancements proposed by the research community. The analysis is performed to highlight how research contributions try to address common Hadoop points of weakness. As it stands out from our comparison, none of the frameworks surpasses the others and a fair evaluation is also difficult to be performed, the choice of the framework must be related to the specific application goal but there is no single solution that addresses all the issues typical of MapReduce.","PeriodicalId":150575,"journal":{"name":"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SYNASC.2014.65","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Nowadays, we live in a digital world producing data at an impressive speed: data are large, change quickly, and are often too complex to be processed by existing tools. The problem is to extract knowledge from all these data in an efficient way. MapReduce is a data parallel programming model for clusters of commodity machines that was created to address this problem. In this paper we provide an overview of the Hadoop ecosystem. We introduce the most significative approaches supporting automatic, on-line resource provisioning. Moreover, we analyse optimization approaches proposed in frameworks built on top of MapReduce, such as Pig and Hive, which point out the importance of scheduling techniques in MapReduce when multiple workflows are executed concurrently. Therefore, the default Hadoop schedulers are discussed along with some enhancements proposed by the research community. The analysis is performed to highlight how research contributions try to address common Hadoop points of weakness. As it stands out from our comparison, none of the frameworks surpasses the others and a fair evaluation is also difficult to be performed, the choice of the framework must be related to the specific application goal but there is no single solution that addresses all the issues typical of MapReduce.