{"title":"Technical Perspective: Optimized Wandering for Online Aggregation","authors":"J. Naughton","doi":"10.1145/3093754.3093762","DOIUrl":null,"url":null,"abstract":"There is a rich history in the DBMS research literature involving sampling to estimate the results of queries faster than they can be computed exactly. A particularly interesting example of this is “Online Aggregation” proposed by Hellerstein et al. in 1997 [2]. There the idea is to combine sampling with a creative and intuitive user interface. Briefly, when a query starts to run, Online Aggregation will quickly present an estimate of the result of the query (based on data sampled up to that point) and will also present a confidence interval around the estimate. As query execution continues, the estimate is refined, and the confidence interval shrinks. Hidden in this attractive idea, however, are some di cult challenges. As an example, for queries that involve joins, the sampling process is in general slow, especially if most of the tuples from one relation participating in the join “match” with only a few tuples in the other relation. For 20 years the state of the art approach to this problem has been the “Ripple Join” [1]. The following paper by Li, Wu, Yi, and Zhao presents a highly e↵ective alternative. The main idea behind the wander join is to use the presence of indexes to speed the sampling, e↵ectively making a random walk through the data join graph. The details of doing this e ciently (both computationally and statistically) are not obvious. The authors of this paper use a clever combination of sampling strategies from the statistical literature and an on-line optimization process to order the paths chosen for the random walk, in the process achieving much better computational and statistical properties than the previously state of the art algorithm. The authors convincingly prove this through experimentation with an open-source implementation in the Postgres database management system.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"17 1","pages":"32"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIGMOD Rec.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3093754.3093762","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
There is a rich history in the DBMS research literature involving sampling to estimate the results of queries faster than they can be computed exactly. A particularly interesting example of this is “Online Aggregation” proposed by Hellerstein et al. in 1997 [2]. There the idea is to combine sampling with a creative and intuitive user interface. Briefly, when a query starts to run, Online Aggregation will quickly present an estimate of the result of the query (based on data sampled up to that point) and will also present a confidence interval around the estimate. As query execution continues, the estimate is refined, and the confidence interval shrinks. Hidden in this attractive idea, however, are some di cult challenges. As an example, for queries that involve joins, the sampling process is in general slow, especially if most of the tuples from one relation participating in the join “match” with only a few tuples in the other relation. For 20 years the state of the art approach to this problem has been the “Ripple Join” [1]. The following paper by Li, Wu, Yi, and Zhao presents a highly e↵ective alternative. The main idea behind the wander join is to use the presence of indexes to speed the sampling, e↵ectively making a random walk through the data join graph. The details of doing this e ciently (both computationally and statistically) are not obvious. The authors of this paper use a clever combination of sampling strategies from the statistical literature and an on-line optimization process to order the paths chosen for the random walk, in the process achieving much better computational and statistical properties than the previously state of the art algorithm. The authors convincingly prove this through experimentation with an open-source implementation in the Postgres database management system.