Using compression tables to improve HiveQL Performance with Spark A Case study on NVMe Storage Devices

2022 26th International Computer Science and Engineering Conference (ICSEC) Pub Date : 2022-12-21 DOI:10.1109/ICSEC56337.2022.10049309

Youppadee Intasorn, Kritwara Rattanaopas, Yanapat Chuchuen

引用次数: 0

Abstract

Query language execution is widely used in big data. The SQL standard is the major query language. Big data has a lot of SQL-like tools, for example: Spark-SQL, Hive, Drill, and Presto. This paper focused on Hive with the Spark engine. To increase Hive’s query performance in a case study, NVMe Solid State Devices, we proposed the compressed Parquet file including SNAPPY, gzip, and Zstandard (zstd). Query workloads use TPC-H benchmark. Thus, this compression codec can reduce the main transaction table of TPC-H benchmark by 56%, and some queries have lower CPU usage than Text file. However, the Hive on Spark engine with our proposed compression codecs for Parquet files has lower CPU usage than Text file in some TPC-H queries. Thus, NVMe storage with the Parquet file compression codec is more efficient than text files for improving query performance on the Spark engine.

查看原文本刊更多论文

使用压缩表提升HiveQL性能—以NVMe存储设备为例

查询语言执行在大数据中有着广泛的应用。SQL标准是主要的查询语言。大数据有很多类似sql的工具，例如:Spark-SQL、Hive、Drill和Presto。本文主要研究Hive和Spark引擎。为了提高Hive的查询性能，在一个案例研究中，NVMe固态设备，我们提出压缩Parquet文件包括SNAPPY, gzip和Zstandard (zstd)。查询工作负载使用TPC-H基准。因此，该压缩编解码器可以将TPC-H基准测试的主事务表减少56%，并且某些查询的CPU使用率低于文本文件。然而，在一些TPC-H查询中，Spark引擎上的Hive使用我们提出的Parquet文件压缩编解码器比Text文件的CPU使用率低。因此，带有Parquet文件压缩编解码器的NVMe存储在提高Spark引擎上的查询性能方面比文本文件更有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 26th International Computer Science and Engineering Conference (ICSEC)

自引率

0.00%

发文量