Big Data Technologies on Commodity Workstations: A Basic Setup for Apache Impala

Proceedings of the 19th International Conference on Computer Systems and Technologies Pub Date : 2018-09-13 DOI:10.1145/3274005.3274021

Marin Fotache, Valerica Greavu-Serban, Ionut Hrubaru, Alexandru Tica

{"title":"Big Data Technologies on Commodity Workstations: A Basic Setup for Apache Impala","authors":"Marin Fotache, Valerica Greavu-Serban, Ionut Hrubaru, Alexandru Tica","doi":"10.1145/3274005.3274021","DOIUrl":null,"url":null,"abstract":"Big Data technologies brought the idea of parallel processing on cheaper commodity servers. When dealing with huge amount of data, instead of migrating to more performant and costly hardware platforms, or buying resources in cloud, it is more affordable to add a number of cheaper servers as nodes for data processing and/or storage. NoSQL data stores, Hadoop ecosystems, NewSQL platforms have proved viable for Big Data storage and processing. In this paper we were concerned with setting up a platform for big data processing using commodity workstations. Many small and medium sized companies have limited resources and their workstations remain unused for more than 12 hours a day. Here Beowulf Cluster Computing could prove useful. Apache Impala was installed as part of a Hadoop distribution on a 9-node cluster. Three TPC-H database schema were loaded for the scale factors of 1, 2 and 10GB. A series of 100 SQL queries were randomly generated and executed for each scale factor. Results were collected and analyzed for determining if the cluster can provide a decent level of data processing performance.","PeriodicalId":152033,"journal":{"name":"Proceedings of the 19th International Conference on Computer Systems and Technologies","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 19th International Conference on Computer Systems and Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3274005.3274021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Big Data technologies brought the idea of parallel processing on cheaper commodity servers. When dealing with huge amount of data, instead of migrating to more performant and costly hardware platforms, or buying resources in cloud, it is more affordable to add a number of cheaper servers as nodes for data processing and/or storage. NoSQL data stores, Hadoop ecosystems, NewSQL platforms have proved viable for Big Data storage and processing. In this paper we were concerned with setting up a platform for big data processing using commodity workstations. Many small and medium sized companies have limited resources and their workstations remain unused for more than 12 hours a day. Here Beowulf Cluster Computing could prove useful. Apache Impala was installed as part of a Hadoop distribution on a 9-node cluster. Three TPC-H database schema were loaded for the scale factors of 1, 2 and 10GB. A series of 100 SQL queries were randomly generated and executed for each scale factor. Results were collected and analyzed for determining if the cluster can provide a decent level of data processing performance.

查看原文本刊更多论文

商品工作站的大数据技术:Apache Impala的基本设置

大数据技术带来了在廉价商品服务器上并行处理的想法。在处理大量数据时，与其迁移到性能更高、成本更高的硬件平台，或者在云中购买资源，不如添加一些更便宜的服务器作为节点进行数据处理和/或存储。NoSQL数据存储、Hadoop生态系统、NewSQL平台已经证明了大数据存储和处理的可行性。在本文中，我们关注的是建立一个使用商品工作站的大数据处理平台。许多中小企业的资源有限，他们的工作站每天有超过12个小时是闲置的。在这里，Beowulf集群计算可以证明是有用的。Apache Impala是作为Hadoop发行版的一部分安装在一个9节点集群上的。加载了3个TPC-H数据库模式，规模因子分别为1、2和10GB。针对每个比例因子随机生成并执行一系列100个SQL查询。收集和分析结果，以确定集群是否能够提供适当水平的数据处理性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 19th International Conference on Computer Systems and Technologies

自引率

0.00%

发文量