lpData: A Data Placement for High-Throughput and Low-Latency

Huiying Zhang, Weixiang Zhang, Bo Wei, Qianran Si
{"title":"lpData: A Data Placement for High-Throughput and Low-Latency","authors":"Huiying Zhang, Weixiang Zhang, Bo Wei, Qianran Si","doi":"10.1109/2ICML58251.2022.00012","DOIUrl":null,"url":null,"abstract":"In the era of data explosion, many storage technologies have emerged for processing and analyzing the big data. Structured storage such as Parquet has highthroughput in sequential read while semi-structured storage such as HBase supports low-latency in random access. However, due to the gap between the two kinds of storage, neither of the storage is suitable to each other's application scenarios. Motivated by applications that need both access patterns, the work proposed a new data placement, that is, lpData. Inheriting efficient record-level index from Lucene and high-throughput file format from Parquet, lpData is able to speed up queries with predicates and guarantees the performance in sequential read at the same time. According to experimental results of this study, a) shows high performance in both sequential read and random access, b) compared to Parquet, lpData executes 42% faster on average in TPC-H selective queries, c) compared to Lucene, lpData outperforms 60% faster for low selective queries and by 36× for high selective queries on average.","PeriodicalId":355485,"journal":{"name":"2022 International Conference on Intelligent Computing and Machine Learning (2ICML)","volume":"399 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Intelligent Computing and Machine Learning (2ICML)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/2ICML58251.2022.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In the era of data explosion, many storage technologies have emerged for processing and analyzing the big data. Structured storage such as Parquet has highthroughput in sequential read while semi-structured storage such as HBase supports low-latency in random access. However, due to the gap between the two kinds of storage, neither of the storage is suitable to each other's application scenarios. Motivated by applications that need both access patterns, the work proposed a new data placement, that is, lpData. Inheriting efficient record-level index from Lucene and high-throughput file format from Parquet, lpData is able to speed up queries with predicates and guarantees the performance in sequential read at the same time. According to experimental results of this study, a) shows high performance in both sequential read and random access, b) compared to Parquet, lpData executes 42% faster on average in TPC-H selective queries, c) compared to Lucene, lpData outperforms 60% faster for low selective queries and by 36× for high selective queries on average.
lpData:用于高吞吐量和低延迟的数据放置
在数据爆炸时代,为了处理和分析大数据,出现了许多存储技术。结构化存储(如Parquet)在顺序读取方面具有高吞吐量,而半结构化存储(如HBase)在随机访问方面具有低延迟。然而,由于两种存储之间的差距,两种存储都不适合彼此的应用场景。由于应用程序需要这两种访问模式,因此提出了一种新的数据放置方式,即lpData。lpData继承了Lucene的高效记录级索引和Parquet的高吞吐量文件格式,能够加速使用谓词的查询,同时保证顺序读取的性能。根据本研究的实验结果,a)在顺序读取和随机访问方面都表现出高性能,b)与Parquet相比,lpData在TPC-H选择性查询中平均执行速度快42%,c)与Lucene相比,lpData在低选择性查询中平均执行速度快60%,在高选择性查询中平均执行速度快36倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信