Planning Your SQL-on-Hadoop Deployment Using a Low-Cost Simulation-Based Approach

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-01 DOI:10.1109/SBAC-PAD.2016.31

Jun Liu, Bianny Bian, Samantika Sury

{"title":"Planning Your SQL-on-Hadoop Deployment Using a Low-Cost Simulation-Based Approach","authors":"Jun Liu, Bianny Bian, Samantika Sury","doi":"10.1109/SBAC-PAD.2016.31","DOIUrl":null,"url":null,"abstract":"The term \"SQL-on-Hadoop\" has recently gained significant traction [19]. Impala represents a new emerging class of SQL-on-Hadoop systems that exploit a shared-nothing parallel database architecture over Hadoop. Impala was designed to close the gap of near real time data analytics on Hadoop stack and it has shown itself to be significantly more efficient than other SQL-on-Hadoop solutions [13]. However, it is not a trivial task to leverage Impala for handling queries with different business demands [12]. Improperly deploying an Impala cluster may not give you the expected performance you want. In this paper, we propose a novel Impala simulation framework to help IT professionals to understand its performance behavior. This would simplify the deployment planning work required to enable big data analytics on SQL-on-Hadoop systems. An Impala simulator models the behavior of a complete software stack and simulates the activities of cluster components such as storage, network, processors and memory. Moreover, the accuracy of the simulation remain high in response to both software configuration and hardware changes, it reflects the expected scaling trend with low cost overhead and fast simulation speed. The Impala simulator has been validated against various S/W and H/W configurations, using the well-known TPC-DS benchmark [15], and the simulation results are valid and expected. A use case is provided to show how one would use the simulator to solve their performance and deployment issues.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SBAC-PAD.2016.31","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

The term "SQL-on-Hadoop" has recently gained significant traction [19]. Impala represents a new emerging class of SQL-on-Hadoop systems that exploit a shared-nothing parallel database architecture over Hadoop. Impala was designed to close the gap of near real time data analytics on Hadoop stack and it has shown itself to be significantly more efficient than other SQL-on-Hadoop solutions [13]. However, it is not a trivial task to leverage Impala for handling queries with different business demands [12]. Improperly deploying an Impala cluster may not give you the expected performance you want. In this paper, we propose a novel Impala simulation framework to help IT professionals to understand its performance behavior. This would simplify the deployment planning work required to enable big data analytics on SQL-on-Hadoop systems. An Impala simulator models the behavior of a complete software stack and simulates the activities of cluster components such as storage, network, processors and memory. Moreover, the accuracy of the simulation remain high in response to both software configuration and hardware changes, it reflects the expected scaling trend with low cost overhead and fast simulation speed. The Impala simulator has been validated against various S/W and H/W configurations, using the well-known TPC-DS benchmark [15], and the simulation results are valid and expected. A use case is provided to show how one would use the simulator to solve their performance and deployment issues.

查看原文本刊更多论文

使用低成本的基于模拟的方法规划sql在hadoop上的部署

“SQL-on-Hadoop”这个术语最近获得了很大的关注[19]。Impala代表了一种新兴的SQL-on-Hadoop系统，它利用Hadoop上无共享的并行数据库架构。Impala旨在缩小Hadoop堆栈上近实时数据分析的差距，并且它已经证明自己比其他SQL-on-Hadoop解决方案更高效[13]。然而，利用Impala处理具有不同业务需求的查询并不是一项简单的任务[12]。不正确地部署Impala集群可能无法提供您想要的预期性能。在本文中，我们提出了一个新颖的Impala仿真框架，以帮助IT专业人员了解其性能行为。这将简化在SQL-on-Hadoop系统上实现大数据分析所需的部署规划工作。Impala模拟器模拟完整软件堆栈的行为，并模拟集群组件(如存储、网络、处理器和内存)的活动。此外，无论软件配置还是硬件变化，仿真精度都保持较高，反映了预期的扩展趋势，成本开销低，仿真速度快。Impala模拟器已经在各种S/W和H/W配置下进行了验证，使用了著名的TPC-DS基准[15]，仿真结果是有效的和预期的。本文提供了一个用例来展示如何使用模拟器来解决性能和部署问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

自引率

0.00%

发文量