Simulating Hive Cluster for Deployment Planning, Evaluation and Optimization

2014 IEEE 6th International Conference on Cloud Computing Technology and Science Pub Date : 2014-12-15 DOI:10.1109/CloudCom.2014.119

Kebing Wang, Zhaojuan Bian, Qian Chen, Ren Wang, Gen Xu

{"title":"Simulating Hive Cluster for Deployment Planning, Evaluation and Optimization","authors":"Kebing Wang, Zhaojuan Bian, Qian Chen, Ren Wang, Gen Xu","doi":"10.1109/CloudCom.2014.119","DOIUrl":null,"url":null,"abstract":"In the era of big data, Hive has quickly gained popularity for its superior capability to manage and analyze very large datasets, both structured and unstructured, residing in distributed storage systems. However, great opportunity comes with great challenges: Hive query performance is impacted by many factors which makes capacity planning and tuning for Hive cluster extremely difficult. These factors include system software stacks (Hive, MapReduce framework, JVM and OS), cluster hardware configurations (processor, memory, storage, and network) and HIVE data models and distributions. Current planning methods are mostly trial-and-error or very high-level estimation based. These approaches are far from efficient and accurate, especially with the increasing software stack complexity, hardware diversity, and unavoidable data skew in distributed database system. In this paper, we propose a Hive simulation framework based on CSMethod, which simulates the whole hive query execution life cycle, including query plan generation and MapReduce task execution. The framework is validated using typical query operations with varying changes in hardware, software and workload parameters, showing high accuracy and fast simulation speed. We also demonstrate the application of this framework with two real-world use cases: helping customers to perform capacity planning and estimate business query response time before system provisioning.","PeriodicalId":249306,"journal":{"name":"2014 IEEE 6th International Conference on Cloud Computing Technology and Science","volume":"169 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 6th International Conference on Cloud Computing Technology and Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CloudCom.2014.119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

In the era of big data, Hive has quickly gained popularity for its superior capability to manage and analyze very large datasets, both structured and unstructured, residing in distributed storage systems. However, great opportunity comes with great challenges: Hive query performance is impacted by many factors which makes capacity planning and tuning for Hive cluster extremely difficult. These factors include system software stacks (Hive, MapReduce framework, JVM and OS), cluster hardware configurations (processor, memory, storage, and network) and HIVE data models and distributions. Current planning methods are mostly trial-and-error or very high-level estimation based. These approaches are far from efficient and accurate, especially with the increasing software stack complexity, hardware diversity, and unavoidable data skew in distributed database system. In this paper, we propose a Hive simulation framework based on CSMethod, which simulates the whole hive query execution life cycle, including query plan generation and MapReduce task execution. The framework is validated using typical query operations with varying changes in hardware, software and workload parameters, showing high accuracy and fast simulation speed. We also demonstrate the application of this framework with two real-world use cases: helping customers to perform capacity planning and estimate business query response time before system provisioning.

查看原文本刊更多论文

用于部署规划、评估和优化的Hive集群模拟

在大数据时代，Hive以其在分布式存储系统中管理和分析超大规模数据集(包括结构化和非结构化数据集)的卓越能力迅速受到欢迎。然而，机遇与挑战并存:Hive查询性能受到许多因素的影响，使得Hive集群的容量规划和调优变得非常困难。这些因素包括系统软件栈(Hive、MapReduce框架、JVM和OS)、集群硬件配置(处理器、内存、存储和网络)以及Hive数据模型和分布。当前的计划方法大多是试错或基于非常高级的估计。特别是在分布式数据库系统中，软件栈的复杂性、硬件的多样性和不可避免的数据倾斜都在不断增加，这些方法的效率和准确性都远远不够。本文提出了一个基于CSMethod的Hive仿真框架，该框架模拟了Hive查询执行的整个生命周期，包括查询计划的生成和MapReduce任务的执行。通过硬件、软件和工作负载参数变化的典型查询操作对该框架进行了验证，结果表明该框架具有较高的仿真精度和较快的仿真速度。我们还通过两个实际用例演示了该框架的应用程序:帮助客户在系统供应之前执行容量规划和估计业务查询响应时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 6th International Conference on Cloud Computing Technology and Science

自引率

0.00%

发文量