A Sampling-Based Hybrid Approximate Query Processing System in the Cloud

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI:10.1109/ICPP.2014.38

Yuxiang Wang, Junzhou Luo, Aibo Song, Fang Dong

{"title":"A Sampling-Based Hybrid Approximate Query Processing System in the Cloud","authors":"Yuxiang Wang, Junzhou Luo, Aibo Song, Fang Dong","doi":"10.1109/ICPP.2014.38","DOIUrl":null,"url":null,"abstract":"Sampling-based approximate query processing method provides the way, in which the users can save their time and resources for 'Big Data' analytical applications, if the estimated results can satisfy the accuracy expectation earlier before a long wait for the final accurate results. Online aggregation (OLA) is such an attractive technology to respond aggregation queries by calculating approximate results with the confidence interval getting tighter over time. It has been built into the MapReuduce-based cloud system for big data analytics, which allows users to monitor the query progress and save money by killing the computation earlier once sufficient accuracy has been obtained. Unfortunately, there exists a major obstacle that is the estimation failure of OLA affects the OLA performance, which is resulted from the biased sample set that violates the unbiased assumption of OLA sampling. To handle this problem, we first propose a hybrid approximate query processing model to improve the overall OLA performance, where a dynamic scheme switching mechanism is deliberately designed to switch unpromising OLA queries into the bootstrap scheme for further processing, avoiding the whole dataset scanning resulted from the OLA estimation failure. In addition, we also present a progressive estimation method to reduce the false positive ratio of our dynamic scheme switching mechanism. Moreover, we have implemented our hybrid approximate query processing system in Hadoop, and conducted extensive experiments on the TPC-H benchmark for skewed data distribution. Our results demonstrate that our hybrid system can produce acceptable approximate results within a time period one order of magnitude shorter compared to the original OLA over Hadoop.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 43rd International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2014.38","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Sampling-based approximate query processing method provides the way, in which the users can save their time and resources for 'Big Data' analytical applications, if the estimated results can satisfy the accuracy expectation earlier before a long wait for the final accurate results. Online aggregation (OLA) is such an attractive technology to respond aggregation queries by calculating approximate results with the confidence interval getting tighter over time. It has been built into the MapReuduce-based cloud system for big data analytics, which allows users to monitor the query progress and save money by killing the computation earlier once sufficient accuracy has been obtained. Unfortunately, there exists a major obstacle that is the estimation failure of OLA affects the OLA performance, which is resulted from the biased sample set that violates the unbiased assumption of OLA sampling. To handle this problem, we first propose a hybrid approximate query processing model to improve the overall OLA performance, where a dynamic scheme switching mechanism is deliberately designed to switch unpromising OLA queries into the bootstrap scheme for further processing, avoiding the whole dataset scanning resulted from the OLA estimation failure. In addition, we also present a progressive estimation method to reduce the false positive ratio of our dynamic scheme switching mechanism. Moreover, we have implemented our hybrid approximate query processing system in Hadoop, and conducted extensive experiments on the TPC-H benchmark for skewed data distribution. Our results demonstrate that our hybrid system can produce acceptable approximate results within a time period one order of magnitude shorter compared to the original OLA over Hadoop.

查看原文本刊更多论文

基于采样的云混合近似查询处理系统

基于抽样的近似查询处理方法为“大数据”分析应用提供了一种节省时间和资源的方法，如果估计的结果能够在较早的时间内满足精度预期，而不是漫长的等待最终的准确结果。在线聚合(OLA)是一种非常有吸引力的技术，它通过计算近似结果来响应聚合查询，并且置信区间随着时间的推移变得越来越紧。它已经被内置到基于mapreduce的大数据分析云系统中，允许用户监控查询进度，并在获得足够的准确性后通过提前终止计算来节省资金。遗憾的是，存在一个主要的障碍，即OLA的估计失败会影响OLA的性能，这是由于有偏的样本集违背了OLA采样的无偏假设。为了解决这个问题，我们首先提出了一种混合近似查询处理模型，以提高整体OLA性能，其中故意设计了动态方案切换机制，将无前途的OLA查询切换到自举方案进行进一步处理，避免了由于OLA估计失败而导致的整个数据集扫描。此外，我们还提出了一种渐进估计方法来降低动态方案切换机制的误报率。此外，我们已经在Hadoop上实现了我们的混合近似查询处理系统，并在TPC-H基准上进行了大量的偏态数据分布实验。我们的结果表明，与Hadoop上的原始OLA相比，我们的混合系统可以在短一个数量级的时间内产生可接受的近似结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 43rd International Conference on Parallel Processing

自引率

0.00%

发文量