Using compression tables to improve HiveQL Performance with Spark A Case study on NVMe Storage Devices

Youppadee Intasorn, Kritwara Rattanaopas, Yanapat Chuchuen
{"title":"Using compression tables to improve HiveQL Performance with Spark A Case study on NVMe Storage Devices","authors":"Youppadee Intasorn, Kritwara Rattanaopas, Yanapat Chuchuen","doi":"10.1109/ICSEC56337.2022.10049309","DOIUrl":null,"url":null,"abstract":"Query language execution is widely used in big data. The SQL standard is the major query language. Big data has a lot of SQL-like tools, for example: Spark-SQL, Hive, Drill, and Presto. This paper focused on Hive with the Spark engine. To increase Hive’s query performance in a case study, NVMe Solid State Devices, we proposed the compressed Parquet file including SNAPPY, gzip, and Zstandard (zstd). Query workloads use TPC-H benchmark. Thus, this compression codec can reduce the main transaction table of TPC-H benchmark by 56%, and some queries have lower CPU usage than Text file. However, the Hive on Spark engine with our proposed compression codecs for Parquet files has lower CPU usage than Text file in some TPC-H queries. Thus, NVMe storage with the Parquet file compression codec is more efficient than text files for improving query performance on the Spark engine.","PeriodicalId":430850,"journal":{"name":"2022 26th International Computer Science and Engineering Conference (ICSEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 26th International Computer Science and Engineering Conference (ICSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSEC56337.2022.10049309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Query language execution is widely used in big data. The SQL standard is the major query language. Big data has a lot of SQL-like tools, for example: Spark-SQL, Hive, Drill, and Presto. This paper focused on Hive with the Spark engine. To increase Hive’s query performance in a case study, NVMe Solid State Devices, we proposed the compressed Parquet file including SNAPPY, gzip, and Zstandard (zstd). Query workloads use TPC-H benchmark. Thus, this compression codec can reduce the main transaction table of TPC-H benchmark by 56%, and some queries have lower CPU usage than Text file. However, the Hive on Spark engine with our proposed compression codecs for Parquet files has lower CPU usage than Text file in some TPC-H queries. Thus, NVMe storage with the Parquet file compression codec is more efficient than text files for improving query performance on the Spark engine.
使用压缩表提升HiveQL性能—以NVMe存储设备为例
查询语言执行在大数据中有着广泛的应用。SQL标准是主要的查询语言。大数据有很多类似sql的工具,例如:Spark-SQL、Hive、Drill和Presto。本文主要研究Hive和Spark引擎。为了提高Hive的查询性能,在一个案例研究中,NVMe固态设备,我们提出压缩Parquet文件包括SNAPPY, gzip和Zstandard (zstd)。查询工作负载使用TPC-H基准。因此,该压缩编解码器可以将TPC-H基准测试的主事务表减少56%,并且某些查询的CPU使用率低于文本文件。然而,在一些TPC-H查询中,Spark引擎上的Hive使用我们提出的Parquet文件压缩编解码器比Text文件的CPU使用率低。因此,带有Parquet文件压缩编解码器的NVMe存储在提高Spark引擎上的查询性能方面比文本文件更有效。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信