Selectivity Estimation for Relation-Tree Joins

32nd International Conference on Scientific and Statistical Database Management Pub Date : 2020-05-09 DOI:10.1145/3400903.3400921

Chao Zhang, Jiaheng Lu

{"title":"Selectivity Estimation for Relation-Tree Joins","authors":"Chao Zhang, Jiaheng Lu","doi":"10.1145/3400903.3400921","DOIUrl":null,"url":null,"abstract":"Estimating the join selectivity is a crucial problem in many aspects of query processing, such as query optimization and query refinement. Selectivity estimation has been extensively studied for the relational joins in SQL queries and structural joins in path-oriented queries. However, as leading databases have supported the multi-model data management on relational and tree-structured data together, a new problem has arisen: the existing estimation techniques mainly work for a single model but not for the heterogeneous situation due to the cross-model joins. A straightforward combination of existing estimators cannot provide a satisfactory estimation quality. This paper studies the problem of selectivity estimation for cross-model joins with relational and tree-structured data. Our estimator is based on the Kernel Density Estimation (KDE) model, which is a statistical approach using a data sample to approximate multivariate probability distribution. KDE has been successfully applied in relational databases to estimate the selectivity of range and join query. In this work, we propose an estimation method called location-value estimation (LVE) model based on KDE, which considers both value joins and structural joins in relational and tree-structured data. To boost the estimation efficiency in large data samples, we further propose the max-min approximation (MMA) and grid-based approximation (GBA) models to approximate the KDE contribution. Extensive experiments on four real and synthetic datasets demonstrate the effectiveness, efficiency, and scalability of our techniques.","PeriodicalId":334018,"journal":{"name":"32nd International Conference on Scientific and Statistical Database Management","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"32nd International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3400903.3400921","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Estimating the join selectivity is a crucial problem in many aspects of query processing, such as query optimization and query refinement. Selectivity estimation has been extensively studied for the relational joins in SQL queries and structural joins in path-oriented queries. However, as leading databases have supported the multi-model data management on relational and tree-structured data together, a new problem has arisen: the existing estimation techniques mainly work for a single model but not for the heterogeneous situation due to the cross-model joins. A straightforward combination of existing estimators cannot provide a satisfactory estimation quality. This paper studies the problem of selectivity estimation for cross-model joins with relational and tree-structured data. Our estimator is based on the Kernel Density Estimation (KDE) model, which is a statistical approach using a data sample to approximate multivariate probability distribution. KDE has been successfully applied in relational databases to estimate the selectivity of range and join query. In this work, we propose an estimation method called location-value estimation (LVE) model based on KDE, which considers both value joins and structural joins in relational and tree-structured data. To boost the estimation efficiency in large data samples, we further propose the max-min approximation (MMA) and grid-based approximation (GBA) models to approximate the KDE contribution. Extensive experiments on four real and synthetic datasets demonstrate the effectiveness, efficiency, and scalability of our techniques.

查看原文本刊更多论文

关系树连接的选择性估计

在查询处理的许多方面，如查询优化和查询细化中，估计连接选择性是一个关键问题。选择性估计在SQL查询中的关系连接和面向路径查询中的结构连接中得到了广泛的研究。然而，随着主流数据库同时支持关系数据和树状结构数据的多模型数据管理，出现了一个新的问题:现有的估计技术主要适用于单一模型，而不能适用于由于跨模型连接而导致的异构情况。现有估计器的直接组合不能提供令人满意的估计质量。研究了关系数据和树结构数据的交叉模型连接的选择性估计问题。我们的估计器基于核密度估计(KDE)模型，这是一种使用数据样本来近似多元概率分布的统计方法。KDE已成功地应用于关系数据库中，用于估计范围和连接查询的选择性。在这项工作中，我们提出了一种基于KDE的位置-值估计(LVE)模型的估计方法，该模型同时考虑了关系数据和树状结构数据中的值连接和结构连接。为了提高在大数据样本中的估计效率，我们进一步提出了最大最小近似(MMA)和基于网格的近似(GBA)模型来近似KDE的贡献。在四个真实和合成数据集上进行的大量实验证明了我们的技术的有效性、效率和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

32nd International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量