Optimal Joins Using Compressed Quadtrees

ACM Transactions on Database Systems (TODS) Pub Date : 2022-02-23 DOI:10.1145/3514231

Diego Arroyuelo, G. Navarro, Juan L. Reutter, J. Rojas-Ledesma

{"title":"Optimal Joins Using Compressed Quadtrees","authors":"Diego Arroyuelo, G. Navarro, Juan L. Reutter, J. Rojas-Ledesma","doi":"10.1145/3514231","DOIUrl":null,"url":null,"abstract":"Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality one either needs to build completely new indexes or must populate the database with several instantiations of indexes such as B \\( + \\) -trees. Either way, this means spending an extra amount of storage space that is typically one or two orders of magnitude more than what is required to store the raw data. We show that worst-case optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of any significant extra storage. Our representation is a compressed quadtreefor the static indexes and a quadtreebuilt on the fly that shares subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, which simulates navigation of the quadtreeof the output, and show that the running time of this algorithm is worst-case optimal in data complexity. We implement our index and compare it experimentally with state-of-the-art alternatives. Our experiments show that our index uses even less space than what is needed to store the data in raw form (and replaces it) and one or two orders of magnitude less space than the other indexes. At the same time, our query algorithm is competitive in time, even sharply outperforming other indexes in various cases. Finally, we extend our framework to evaluate more expressive queries from relational algebra, including not only joins and intersections but also unions and negations. To obtain optimality on those more complex formulas, we introduce a lazy version of qdagswe dub lqdags, which allow us navigate over the quadtreerepresenting the output of a formula while only evaluating what is needed from its components. We show that the running time of our query algorithms on this extended set of operations is worst-case optimal under some constraints. Moving to full relational algebra, we also show that lqdagscan handle selections and projections. While worst-case optimality is no longer guaranteed, we introduce a partial materialization scheme that extends results from Deep and Koutris regarding compressed representation of query results.","PeriodicalId":6983,"journal":{"name":"ACM Transactions on Database Systems (TODS)","volume":"31 1","pages":"1 - 53"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems (TODS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3514231","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Worst-case optimal join algorithms have gained a lot of attention in the database literature. We now count several algorithms that are optimal in the worst case, and many of them have been implemented and validated in practice. However, the implementation of these algorithms often requires an enhanced indexing structure: to achieve optimality one either needs to build completely new indexes or must populate the database with several instantiations of indexes such as B \( + \) -trees. Either way, this means spending an extra amount of storage space that is typically one or two orders of magnitude more than what is required to store the raw data. We show that worst-case optimal algorithms can be obtained directly from a representation that regards the relations as point sets in variable-dimensional grids, without the need of any significant extra storage. Our representation is a compressed quadtreefor the static indexes and a quadtreebuilt on the fly that shares subtrees (which we dub a qdag) for intermediate results. We develop a compositional algorithm to process full join queries under this representation, which simulates navigation of the quadtreeof the output, and show that the running time of this algorithm is worst-case optimal in data complexity. We implement our index and compare it experimentally with state-of-the-art alternatives. Our experiments show that our index uses even less space than what is needed to store the data in raw form (and replaces it) and one or two orders of magnitude less space than the other indexes. At the same time, our query algorithm is competitive in time, even sharply outperforming other indexes in various cases. Finally, we extend our framework to evaluate more expressive queries from relational algebra, including not only joins and intersections but also unions and negations. To obtain optimality on those more complex formulas, we introduce a lazy version of qdagswe dub lqdags, which allow us navigate over the quadtreerepresenting the output of a formula while only evaluating what is needed from its components. We show that the running time of our query algorithms on this extended set of operations is worst-case optimal under some constraints. Moving to full relational algebra, we also show that lqdagscan handle selections and projections. While worst-case optimality is no longer guaranteed, we introduce a partial materialization scheme that extends results from Deep and Koutris regarding compressed representation of query results.

查看原文本刊更多论文

使用压缩四叉树的最优连接

最坏情况最优连接算法在数据库文献中得到了很多关注。我们现在列举了几种在最坏情况下最优的算法，其中许多算法已经在实践中实现和验证了。然而，这些算法的实现通常需要一个增强的索引结构:为了实现最优性，要么需要构建全新的索引，要么必须用索引的几个实例(如B \( + \) -trees)填充数据库。无论哪种方式，这都意味着要花费额外的存储空间，通常比存储原始数据所需的存储空间多一到两个数量级。我们证明了最坏情况最优算法可以直接从将关系视为变维网格中的点集的表示中获得，而不需要任何显著的额外存储。我们的表示是一个用于静态索引的压缩四叉树和一个动态构建的用于共享中间结果的子树(我们称之为qdag)的四叉树。我们开发了一种组合算法来处理这种表示下的全连接查询，该算法模拟了输出四叉树的导航，并表明该算法的运行时间在数据复杂度上是最坏情况下的最优。我们实现我们的指数，并将其与最先进的替代方案进行实验比较。我们的实验表明，我们的索引使用的空间甚至比以原始形式存储数据(并替换它)所需的空间还要少，而且比其他索引使用的空间少一到两个数量级。同时，我们的查询算法在时间上具有竞争力，在各种情况下甚至大大优于其他索引。最后，我们扩展了我们的框架来评估来自关系代数的更具表现力的查询，不仅包括连接和交集，还包括联合和否定。为了在这些更复杂的公式上获得最优性，我们引入了一个惰性版本的qdag(称为lqdag)，它允许我们在表示公式输出的四叉树上导航，同时只评估其组件所需的内容。我们证明了在某些约束条件下，我们的查询算法在这个扩展操作集上的运行时间是最坏情况下最优的。转到完整的关系代数，我们还将展示lqdagscan处理选择和投影。虽然不再保证最坏情况最优性，但我们引入了一个部分物化方案，该方案扩展了Deep和Koutris关于查询结果压缩表示的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Database Systems (TODS)

自引率

0.00%

发文量