Fast approximate computation of statistics on views

Proceedings of the 2006 ACM SIGMOD international conference on Management of data Pub Date : 2006-06-27 DOI:10.1145/1142473.1142564

C. Zuzarte, Xiaohui Yu

{"title":"Fast approximate computation of statistics on views","authors":"C. Zuzarte, Xiaohui Yu","doi":"10.1145/1142473.1142564","DOIUrl":null,"url":null,"abstract":"Accurate estimation of the sizes of intermediate query results (cardinality estimation) is of critical importance to plan costing in query optimization. The common practice in current commercial database systems such as IBM DB2 Universal Database (DB2 UDB) is to derive the cardinality estimates from base-table statistics. However, this approach often suffers from simplifying yet unrealistic assumptions that have to be made about the underlying data (for example, different attributes are independently distributed).Ways for exploiting statistics on query expressions (or, statistics on views, or SITs) have been proposed to improve the accuracy of cardinality estimation. We propose a novel method for efficient computation of SITs for joins. In particular, we are concerned with statistics on join queries involving large fact tables and relatively small dimension tables. Rather than materializing the views, we make use of the frequency statistics that are available on the fact tables to obtain an approximate estimate of the statistics on various attributes in the join results. The dimension tables are generally much smaller than the fact table, and therefore we can afford to closely examine the dimension table, while at the same time avoid accessing the fact table. By closely examining the dimension table, we are able to capture the correlations between the attributes in the dimension table as well as the skew and domain range of the fact table join column values. This leads to reasonably accurate statistics on the join result. We prototyped this idea as a module on top of DB2 UDB, and our experience shows that employment of this technique results in a very significant speed-up in the computation of SITs, at the expense of only slight degradation in accuracy compared with the full-materialization method.","PeriodicalId":416090,"journal":{"name":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","volume":"107 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2006 ACM SIGMOD international conference on Management of data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1142473.1142564","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Accurate estimation of the sizes of intermediate query results (cardinality estimation) is of critical importance to plan costing in query optimization. The common practice in current commercial database systems such as IBM DB2 Universal Database (DB2 UDB) is to derive the cardinality estimates from base-table statistics. However, this approach often suffers from simplifying yet unrealistic assumptions that have to be made about the underlying data (for example, different attributes are independently distributed).Ways for exploiting statistics on query expressions (or, statistics on views, or SITs) have been proposed to improve the accuracy of cardinality estimation. We propose a novel method for efficient computation of SITs for joins. In particular, we are concerned with statistics on join queries involving large fact tables and relatively small dimension tables. Rather than materializing the views, we make use of the frequency statistics that are available on the fact tables to obtain an approximate estimate of the statistics on various attributes in the join results. The dimension tables are generally much smaller than the fact table, and therefore we can afford to closely examine the dimension table, while at the same time avoid accessing the fact table. By closely examining the dimension table, we are able to capture the correlations between the attributes in the dimension table as well as the skew and domain range of the fact table join column values. This leads to reasonably accurate statistics on the join result. We prototyped this idea as a module on top of DB2 UDB, and our experience shows that employment of this technique results in a very significant speed-up in the computation of SITs, at the expense of only slight degradation in accuracy compared with the full-materialization method.

查看原文本刊更多论文

视图统计的快速近似计算

准确估计中间查询结果的大小(基数估计)对于查询优化中的计划开销至关重要。当前商业数据库系统(如IBM DB2 Universal database (DB2 UDB))中的常见做法是从基表统计信息中获得基数估计值。然而，这种方法常常会因为必须对底层数据(例如，不同的属性是独立分布的)做出简化且不切实际的假设而受到影响。已经提出了利用查询表达式上的统计信息(或视图上的统计信息，或sit)的方法，以提高基数估计的准确性。我们提出了一种高效计算连接sit的新方法。我们特别关注涉及大型事实表和相对较小维度表的连接查询的统计信息。我们没有具体化视图，而是利用事实表上可用的频率统计信息来获得连接结果中各种属性的统计信息的近似估计。维度表通常比事实表小得多，因此我们可以仔细检查维度表，同时避免访问事实表。通过仔细检查维度表，我们能够捕获维度表中属性之间的相关性，以及事实表连接列值的倾斜度和域范围。这将导致对连接结果的相当准确的统计信息。我们将这个想法作为DB2 UDB之上的一个模块进行了原型化，我们的经验表明，使用这种技术可以大大加快sit的计算速度，而与完全实物化方法相比，其代价是精度只有轻微的降低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2006 ACM SIGMOD international conference on Management of data

自引率

0.00%

发文量