An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI:10.1145/2882903.2882917

Stefan Schuh, Xiao Chen, J. Dittrich

{"title":"An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory","authors":"Stefan Schuh, Xiao Chen, J. Dittrich","doi":"10.1145/2882903.2882917","DOIUrl":null,"url":null,"abstract":"Relational equi-joins are at the heart of almost every query plan. They have been studied, improved, and reexamined on a regular basis since the existence of the database community. In the past four years several new join algorithms have been proposed and experimentally evaluated. Some of those papers contradict each other in their experimental findings. This makes it surprisingly hard to answer a very simple question: what is the fastest join algorithm in 2015? In this paper we will try to develop an answer. We start with an end-to-end black box comparison of the most important methods. Afterwards, we inspect the internals of these algorithms in a white box comparison. We derive improved variants of state-of-the-art join algorithms by applying optimizations like~software-write combine buffers, various hash table implementations, as well as NUMA-awareness in terms of data placement and scheduling. We also inspect various radix partitioning strategies. Eventually, we are in the position to perform a comprehensive comparison of thirteen different join algorithms. We factor in scaling effects in terms of size of the input datasets, the number of threads, different page sizes, and data distributions. Furthermore, we analyze the impact of various joins on an (unchanged) TPC-H query. Finally, we conclude with a list of major lessons learned from our study and a guideline for practitioners implementing massive main-memory joins. As is the case with almost all algorithms in databases, we will learn that there is no single best join algorithm. Each algorithm has its strength and weaknesses and shines in different areas of the parameter space.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"294 1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"91","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2882903.2882917","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 91

Abstract

Relational equi-joins are at the heart of almost every query plan. They have been studied, improved, and reexamined on a regular basis since the existence of the database community. In the past four years several new join algorithms have been proposed and experimentally evaluated. Some of those papers contradict each other in their experimental findings. This makes it surprisingly hard to answer a very simple question: what is the fastest join algorithm in 2015? In this paper we will try to develop an answer. We start with an end-to-end black box comparison of the most important methods. Afterwards, we inspect the internals of these algorithms in a white box comparison. We derive improved variants of state-of-the-art join algorithms by applying optimizations like~software-write combine buffers, various hash table implementations, as well as NUMA-awareness in terms of data placement and scheduling. We also inspect various radix partitioning strategies. Eventually, we are in the position to perform a comprehensive comparison of thirteen different join algorithms. We factor in scaling effects in terms of size of the input datasets, the number of threads, different page sizes, and data distributions. Furthermore, we analyze the impact of various joins on an (unchanged) TPC-H query. Finally, we conclude with a list of major lessons learned from our study and a guideline for practitioners implementing massive main-memory joins. As is the case with almost all algorithms in databases, we will learn that there is no single best join algorithm. Each algorithm has its strength and weaknesses and shines in different areas of the parameter space.

查看原文本刊更多论文

主存中13种关系对等连接的实验比较

关系等连接几乎是每个查询计划的核心。自从数据库社区存在以来，它们一直在定期地被研究、改进和重新检查。在过去的四年中，已经提出了几种新的连接算法并进行了实验评估。其中一些论文的实验结果相互矛盾。这使得回答一个非常简单的问题变得异常困难:2015年最快的连接算法是什么?在本文中，我们将尝试给出一个答案。我们从最重要的方法的端到端黑盒比较开始。然后，我们在白盒比较中检查这些算法的内部。我们通过应用优化，如软件写入组合缓冲区、各种散列表实现，以及数据放置和调度方面的numa感知，派生出最先进的连接算法的改进变体。我们还考察了各种基数划分策略。最后，我们将对13种不同的连接算法进行全面的比较。我们根据输入数据集的大小、线程数量、不同页面大小和数据分布来考虑缩放效应。此外，我们分析了各种连接对(未更改的)TPC-H查询的影响。最后，我们总结了从我们的研究中得到的主要经验教训，并为从业者实现大规模主存连接提供了指导。与数据库中几乎所有算法的情况一样，我们将了解到没有单一的最佳连接算法。每种算法都有其优点和缺点，并在参数空间的不同领域发挥作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 International Conference on Management of Data

自引率

0.00%

发文量