{"title":"Using compression tables to improve HiveQL Performance with Spark A Case study on NVMe Storage Devices","authors":"Youppadee Intasorn, Kritwara Rattanaopas, Yanapat Chuchuen","doi":"10.1109/ICSEC56337.2022.10049309","DOIUrl":null,"url":null,"abstract":"Query language execution is widely used in big data. The SQL standard is the major query language. Big data has a lot of SQL-like tools, for example: Spark-SQL, Hive, Drill, and Presto. This paper focused on Hive with the Spark engine. To increase Hive’s query performance in a case study, NVMe Solid State Devices, we proposed the compressed Parquet file including SNAPPY, gzip, and Zstandard (zstd). Query workloads use TPC-H benchmark. Thus, this compression codec can reduce the main transaction table of TPC-H benchmark by 56%, and some queries have lower CPU usage than Text file. However, the Hive on Spark engine with our proposed compression codecs for Parquet files has lower CPU usage than Text file in some TPC-H queries. Thus, NVMe storage with the Parquet file compression codec is more efficient than text files for improving query performance on the Spark engine.","PeriodicalId":430850,"journal":{"name":"2022 26th International Computer Science and Engineering Conference (ICSEC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 26th International Computer Science and Engineering Conference (ICSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSEC56337.2022.10049309","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Query language execution is widely used in big data. The SQL standard is the major query language. Big data has a lot of SQL-like tools, for example: Spark-SQL, Hive, Drill, and Presto. This paper focused on Hive with the Spark engine. To increase Hive’s query performance in a case study, NVMe Solid State Devices, we proposed the compressed Parquet file including SNAPPY, gzip, and Zstandard (zstd). Query workloads use TPC-H benchmark. Thus, this compression codec can reduce the main transaction table of TPC-H benchmark by 56%, and some queries have lower CPU usage than Text file. However, the Hive on Spark engine with our proposed compression codecs for Parquet files has lower CPU usage than Text file in some TPC-H queries. Thus, NVMe storage with the Parquet file compression codec is more efficient than text files for improving query performance on the Spark engine.