Sampling-Based AQP in Modern Analytical Engines

Proceedings of the 18th International Workshop on Data Management on New Hardware Pub Date : 2022-06-12 DOI:10.1145/3533737.3535095

Viktor Sanca, A. Ailamaki

{"title":"Sampling-Based AQP in Modern Analytical Engines","authors":"Viktor Sanca, A. Ailamaki","doi":"10.1145/3533737.3535095","DOIUrl":null,"url":null,"abstract":"As the data volume grows, reducing the query execution times remains an elusive goal. While approximate query processing (AQP) techniques present a principled method to trade off accuracy for faster queries in analytics, the sample creation is often considered a second-class citizen. Modern analytical engines optimized for high-bandwidth media and multi-core architectures only exacerbate existing inefficiencies, resulting in prohibitive query-time online sampling and longer preprocessing times in offline AQP systems. We demonstrate that the sampling operators can be practical in modern scale-up analytical systems. First, we evaluate three common sampling methods, identify algorithmic bottlenecks, and propose hardware-conscious optimizations. Second, we reduce the performance penalties of the added processing and sample materialization through system-aware operator design and compare the sample creation time to the matching relational operators of an in-memory JIT-compiled engine. The cost of data reduction with materialization is up to 2.5x of the equivalent group-by in the case of stratified sampling and virtually free (∼ 1x) for reasonable sample sizes of other strategies. As query processing starts to dominate the execution time, the gap between online and offline AQP methods diminishes.","PeriodicalId":381503,"journal":{"name":"Proceedings of the 18th International Workshop on Data Management on New Hardware","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th International Workshop on Data Management on New Hardware","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3533737.3535095","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

As the data volume grows, reducing the query execution times remains an elusive goal. While approximate query processing (AQP) techniques present a principled method to trade off accuracy for faster queries in analytics, the sample creation is often considered a second-class citizen. Modern analytical engines optimized for high-bandwidth media and multi-core architectures only exacerbate existing inefficiencies, resulting in prohibitive query-time online sampling and longer preprocessing times in offline AQP systems. We demonstrate that the sampling operators can be practical in modern scale-up analytical systems. First, we evaluate three common sampling methods, identify algorithmic bottlenecks, and propose hardware-conscious optimizations. Second, we reduce the performance penalties of the added processing and sample materialization through system-aware operator design and compare the sample creation time to the matching relational operators of an in-memory JIT-compiled engine. The cost of data reduction with materialization is up to 2.5x of the equivalent group-by in the case of stratified sampling and virtually free (∼ 1x) for reasonable sample sizes of other strategies. As query processing starts to dominate the execution time, the gap between online and offline AQP methods diminishes.

查看原文本刊更多论文

现代分析机中基于采样的AQP

随着数据量的增长，减少查询执行时间仍然是一个难以实现的目标。虽然近似查询处理(AQP)技术提供了一种原则性的方法，可以在分析中权衡准确性以获得更快的查询，但样本创建通常被认为是二等公民。针对高带宽媒体和多核架构优化的现代分析引擎只会加剧现有的低效率，导致离线AQP系统中查询时间过高的在线采样和更长的预处理时间。我们证明了采样算子在现代放大分析系统中是实用的。首先，我们评估了三种常见的采样方法，确定了算法瓶颈，并提出了硬件意识优化。其次，我们通过系统感知运算符设计减少了添加处理和样本物化的性能损失，并将样本创建时间与内存中jit编译引擎的匹配关系运算符进行了比较。物化数据减少的成本在分层抽样的情况下高达等效组的2.5倍，而对于其他策略的合理样本量几乎是免费的(~ 1x)。由于查询处理开始主导执行时间，在线AQP方法和离线AQP方法之间的差距缩小了。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 18th International Workshop on Data Management on New Hardware

自引率

0.00%

发文量