Hailong Yang, Zhongzhi Luan, Wenjun Li, D. Qian, Gang Guan
{"title":"Statistics-based Workload Modeling for MapReduce","authors":"Hailong Yang, Zhongzhi Luan, Wenjun Li, D. Qian, Gang Guan","doi":"10.1109/IPDPSW.2012.254","DOIUrl":null,"url":null,"abstract":"Large-scale data-intensive computing with MapReduce framework in Cloud is becoming pervasive for the core business of many academic, government, and industrial organizations. Hadoop is by far the most successful realization of MapReduce framework. While MapReduce is easy-to-use, efficient and reliable for data-intensive computations, the excessive configuration parameters in Hadoop cause unexpected challenges when running various workloads with Hadoop cluster effectively. Consequently, developers who have less experience with the Hadoop configuration system may devote a significant effort to write an application with poor performance, because they have no idea how these configurations would influence the performance, or they are not even aware that these configurations exist. In this paper, we propose a statistic analysis approach to identify the relationships among workload characteristics, Hadoop configurations and workload performance. Several non-intuitive relationships between workload characteristics and relative performance are revealed and the experimental results demonstrate that our regression models accurately predict the performance of MapReduce workloads under different Hadoop configurations.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2012.254","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24
Abstract
Large-scale data-intensive computing with MapReduce framework in Cloud is becoming pervasive for the core business of many academic, government, and industrial organizations. Hadoop is by far the most successful realization of MapReduce framework. While MapReduce is easy-to-use, efficient and reliable for data-intensive computations, the excessive configuration parameters in Hadoop cause unexpected challenges when running various workloads with Hadoop cluster effectively. Consequently, developers who have less experience with the Hadoop configuration system may devote a significant effort to write an application with poor performance, because they have no idea how these configurations would influence the performance, or they are not even aware that these configurations exist. In this paper, we propose a statistic analysis approach to identify the relationships among workload characteristics, Hadoop configurations and workload performance. Several non-intuitive relationships between workload characteristics and relative performance are revealed and the experimental results demonstrate that our regression models accurately predict the performance of MapReduce workloads under different Hadoop configurations.