{"title":"lpData: A Data Placement for High-Throughput and Low-Latency","authors":"Huiying Zhang, Weixiang Zhang, Bo Wei, Qianran Si","doi":"10.1109/2ICML58251.2022.00012","DOIUrl":null,"url":null,"abstract":"In the era of data explosion, many storage technologies have emerged for processing and analyzing the big data. Structured storage such as Parquet has highthroughput in sequential read while semi-structured storage such as HBase supports low-latency in random access. However, due to the gap between the two kinds of storage, neither of the storage is suitable to each other's application scenarios. Motivated by applications that need both access patterns, the work proposed a new data placement, that is, lpData. Inheriting efficient record-level index from Lucene and high-throughput file format from Parquet, lpData is able to speed up queries with predicates and guarantees the performance in sequential read at the same time. According to experimental results of this study, a) shows high performance in both sequential read and random access, b) compared to Parquet, lpData executes 42% faster on average in TPC-H selective queries, c) compared to Lucene, lpData outperforms 60% faster for low selective queries and by 36× for high selective queries on average.","PeriodicalId":355485,"journal":{"name":"2022 International Conference on Intelligent Computing and Machine Learning (2ICML)","volume":"399 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Intelligent Computing and Machine Learning (2ICML)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/2ICML58251.2022.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In the era of data explosion, many storage technologies have emerged for processing and analyzing the big data. Structured storage such as Parquet has highthroughput in sequential read while semi-structured storage such as HBase supports low-latency in random access. However, due to the gap between the two kinds of storage, neither of the storage is suitable to each other's application scenarios. Motivated by applications that need both access patterns, the work proposed a new data placement, that is, lpData. Inheriting efficient record-level index from Lucene and high-throughput file format from Parquet, lpData is able to speed up queries with predicates and guarantees the performance in sequential read at the same time. According to experimental results of this study, a) shows high performance in both sequential read and random access, b) compared to Parquet, lpData executes 42% faster on average in TPC-H selective queries, c) compared to Lucene, lpData outperforms 60% faster for low selective queries and by 36× for high selective queries on average.