Analysis of the Influence of Modeling, Data Format and Processing Tool on the Performance of Hadoop-Hive Based Data Warehouse

J. Inf. Data Manag. Pub Date : 2022-09-21 DOI:10.5753/jidm.2022.2516

Beatriz Fragnan P. de Oliveira, A. Valente, M. Victorino, E. Ribeiro, M. Holanda

{"title":"Analysis of the Influence of Modeling, Data Format and Processing Tool on the Performance of Hadoop-Hive Based Data Warehouse","authors":"Beatriz Fragnan P. de Oliveira, A. Valente, M. Victorino, E. Ribeiro, M. Holanda","doi":"10.5753/jidm.2022.2516","DOIUrl":null,"url":null,"abstract":"With the emergence of Big Data and the continuous growth of massive data produced by web applications, smartphones, social networks, and others, organizations began to invest in alternative solutions that would derive value from this amount of data. In this context, this article evaluates three factors that can significantly influence the performance of Big Data Hive queries: data modeling, data format and processing tool. The objective is to present a comparative analysis of the Hive platform performance with the snowflake model and the fully denormalized one. Moreover, the influence of two types of table storage file types (CSV and Parquet) and two types of data processing tools, Hadoop and Spark, were also comparatively analyzed. The data used for analysis is the open data of the Brazilian Army in the Google Cloud environment. Analysis was performed for different data volumes in Hive and cluster configuration scenarios. The results yielded that the Parquet storage format always performed better than when CSV storage formats were used, regardless of the model and processing tool selected for the test scenario.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Inf. Data Manag.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/jidm.2022.2516","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

With the emergence of Big Data and the continuous growth of massive data produced by web applications, smartphones, social networks, and others, organizations began to invest in alternative solutions that would derive value from this amount of data. In this context, this article evaluates three factors that can significantly influence the performance of Big Data Hive queries: data modeling, data format and processing tool. The objective is to present a comparative analysis of the Hive platform performance with the snowflake model and the fully denormalized one. Moreover, the influence of two types of table storage file types (CSV and Parquet) and two types of data processing tools, Hadoop and Spark, were also comparatively analyzed. The data used for analysis is the open data of the Brazilian Army in the Google Cloud environment. Analysis was performed for different data volumes in Hive and cluster configuration scenarios. The results yielded that the Parquet storage format always performed better than when CSV storage formats were used, regardless of the model and processing tool selected for the test scenario.

查看原文本刊更多论文

基于Hadoop-Hive的数据仓库建模、数据格式和处理工具对性能的影响分析

随着大数据的出现，以及网络应用程序、智能手机、社交网络等产生的海量数据的持续增长，企业开始投资于能够从海量数据中获取价值的替代解决方案。在此背景下，本文评估了影响大数据Hive查询性能的三个因素:数据建模、数据格式和处理工具。目的是比较分析Hive平台在雪花模型和完全非规范化模型下的性能。对比分析了两种表存储文件类型(CSV和Parquet)以及两种数据处理工具(Hadoop和Spark)对数据处理的影响。用于分析的数据是巴西军队在Google Cloud环境中的开放数据。分别对Hive和集群配置场景下的不同数据卷进行分析。结果表明，无论为测试场景选择哪种模型和处理工具，使用Parquet存储格式总是比使用CSV存储格式执行得更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

J. Inf. Data Manag.

自引率

0.00%

发文量