lpData: A Data Placement for High-Throughput and Low-Latency

2022 International Conference on Intelligent Computing and Machine Learning (2ICML) Pub Date : 2023-04-01 DOI:10.1109/2ICML58251.2022.00012

Huiying Zhang, Weixiang Zhang, Bo Wei, Qianran Si

{"title":"lpData: A Data Placement for High-Throughput and Low-Latency","authors":"Huiying Zhang, Weixiang Zhang, Bo Wei, Qianran Si","doi":"10.1109/2ICML58251.2022.00012","DOIUrl":null,"url":null,"abstract":"In the era of data explosion, many storage technologies have emerged for processing and analyzing the big data. Structured storage such as Parquet has highthroughput in sequential read while semi-structured storage such as HBase supports low-latency in random access. However, due to the gap between the two kinds of storage, neither of the storage is suitable to each other's application scenarios. Motivated by applications that need both access patterns, the work proposed a new data placement, that is, lpData. Inheriting efficient record-level index from Lucene and high-throughput file format from Parquet, lpData is able to speed up queries with predicates and guarantees the performance in sequential read at the same time. According to experimental results of this study, a) shows high performance in both sequential read and random access, b) compared to Parquet, lpData executes 42% faster on average in TPC-H selective queries, c) compared to Lucene, lpData outperforms 60% faster for low selective queries and by 36× for high selective queries on average.","PeriodicalId":355485,"journal":{"name":"2022 International Conference on Intelligent Computing and Machine Learning (2ICML)","volume":"399 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Intelligent Computing and Machine Learning (2ICML)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/2ICML58251.2022.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In the era of data explosion, many storage technologies have emerged for processing and analyzing the big data. Structured storage such as Parquet has highthroughput in sequential read while semi-structured storage such as HBase supports low-latency in random access. However, due to the gap between the two kinds of storage, neither of the storage is suitable to each other's application scenarios. Motivated by applications that need both access patterns, the work proposed a new data placement, that is, lpData. Inheriting efficient record-level index from Lucene and high-throughput file format from Parquet, lpData is able to speed up queries with predicates and guarantees the performance in sequential read at the same time. According to experimental results of this study, a) shows high performance in both sequential read and random access, b) compared to Parquet, lpData executes 42% faster on average in TPC-H selective queries, c) compared to Lucene, lpData outperforms 60% faster for low selective queries and by 36× for high selective queries on average.

查看原文本刊更多论文

lpData:用于高吞吐量和低延迟的数据放置

在数据爆炸时代，为了处理和分析大数据，出现了许多存储技术。结构化存储(如Parquet)在顺序读取方面具有高吞吐量，而半结构化存储(如HBase)在随机访问方面具有低延迟。然而，由于两种存储之间的差距，两种存储都不适合彼此的应用场景。由于应用程序需要这两种访问模式，因此提出了一种新的数据放置方式，即lpData。lpData继承了Lucene的高效记录级索引和Parquet的高吞吐量文件格式，能够加速使用谓词的查询，同时保证顺序读取的性能。根据本研究的实验结果，a)在顺序读取和随机访问方面都表现出高性能，b)与Parquet相比，lpData在TPC-H选择性查询中平均执行速度快42%，c)与Lucene相比，lpData在低选择性查询中平均执行速度快60%，在高选择性查询中平均执行速度快36倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Conference on Intelligent Computing and Machine Learning (2ICML)

自引率

0.00%

发文量