Technical Perspective: Optimized Wandering for Online Aggregation

SIGMOD Rec. Pub Date : 2017-05-12 DOI:10.1145/3093754.3093762

J. Naughton

{"title":"Technical Perspective: Optimized Wandering for Online Aggregation","authors":"J. Naughton","doi":"10.1145/3093754.3093762","DOIUrl":null,"url":null,"abstract":"There is a rich history in the DBMS research literature involving sampling to estimate the results of queries faster than they can be computed exactly. A particularly interesting example of this is “Online Aggregation” proposed by Hellerstein et al. in 1997 [2]. There the idea is to combine sampling with a creative and intuitive user interface. Briefly, when a query starts to run, Online Aggregation will quickly present an estimate of the result of the query (based on data sampled up to that point) and will also present a confidence interval around the estimate. As query execution continues, the estimate is refined, and the confidence interval shrinks. Hidden in this attractive idea, however, are some di cult challenges. As an example, for queries that involve joins, the sampling process is in general slow, especially if most of the tuples from one relation participating in the join “match” with only a few tuples in the other relation. For 20 years the state of the art approach to this problem has been the “Ripple Join” [1]. The following paper by Li, Wu, Yi, and Zhao presents a highly e↵ective alternative. The main idea behind the wander join is to use the presence of indexes to speed the sampling, e↵ectively making a random walk through the data join graph. The details of doing this e ciently (both computationally and statistically) are not obvious. The authors of this paper use a clever combination of sampling strategies from the statistical literature and an on-line optimization process to order the paths chosen for the random walk, in the process achieving much better computational and statistical properties than the previously state of the art algorithm. The authors convincingly prove this through experimentation with an open-source implementation in the Postgres database management system.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"17 1","pages":"32"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIGMOD Rec.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3093754.3093762","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

There is a rich history in the DBMS research literature involving sampling to estimate the results of queries faster than they can be computed exactly. A particularly interesting example of this is “Online Aggregation” proposed by Hellerstein et al. in 1997 [2]. There the idea is to combine sampling with a creative and intuitive user interface. Briefly, when a query starts to run, Online Aggregation will quickly present an estimate of the result of the query (based on data sampled up to that point) and will also present a confidence interval around the estimate. As query execution continues, the estimate is refined, and the confidence interval shrinks. Hidden in this attractive idea, however, are some di cult challenges. As an example, for queries that involve joins, the sampling process is in general slow, especially if most of the tuples from one relation participating in the join “match” with only a few tuples in the other relation. For 20 years the state of the art approach to this problem has been the “Ripple Join” [1]. The following paper by Li, Wu, Yi, and Zhao presents a highly e↵ective alternative. The main idea behind the wander join is to use the presence of indexes to speed the sampling, e↵ectively making a random walk through the data join graph. The details of doing this e ciently (both computationally and statistically) are not obvious. The authors of this paper use a clever combination of sampling strategies from the statistical literature and an on-line optimization process to order the paths chosen for the random walk, in the process achieving much better computational and statistical properties than the previously state of the art algorithm. The authors convincingly prove this through experimentation with an open-source implementation in the Postgres database management system.

查看原文本刊更多论文

技术视角:在线聚合优化漫游

在DBMS研究文献中有丰富的历史，涉及到抽样来估计查询结果的速度比精确计算查询结果的速度要快。一个特别有趣的例子是Hellerstein等人在1997年提出的“在线聚合”(Online Aggregation)[2]。这里的想法是将采样与创造性和直观的用户界面相结合。简而言之，当查询开始运行时，Online Aggregation将快速显示查询结果的估计值(基于到该点为止采样的数据)，并且还将显示估计值周围的置信区间。随着查询执行的继续，估计会得到改进，置信区间会缩小。然而，在这个诱人的想法背后，隐藏着一些严峻的挑战。例如，对于涉及连接的查询，采样过程通常很慢，特别是当参与连接的一个关系中的大多数元组与另一个关系中的少数元组“匹配”时。20年来，解决这个问题的最先进的方法是“Ripple Join”[1]。下面由Li、Wu、Yi和Zhao撰写的论文提出了一个非常有效的替代方案。漫游连接背后的主要思想是利用索引的存在来加快采样速度，即在数据连接图中进行随机漫步。高效地(在计算上和统计上)做到这一点的细节并不明显。本文的作者巧妙地结合了统计文献中的抽样策略和在线优化过程来为随机行走选择路径，在此过程中获得了比以前最先进的算法更好的计算和统计特性。作者通过在Postgres数据库管理系统中使用开源实现的实验令人信服地证明了这一点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SIGMOD Rec.

自引率

0.00%

发文量