Locality-aware Partitioning in Parallel Database Systems

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data Pub Date : 2015-05-27 DOI:10.1145/2723372.2723718

Erfan Zamanian, Carsten Binnig, Abdallah Salama

{"title":"Locality-aware Partitioning in Parallel Database Systems","authors":"Erfan Zamanian, Carsten Binnig, Abdallah Salama","doi":"10.1145/2723372.2723718","DOIUrl":null,"url":null,"abstract":"Parallel database systems horizontally partition large amounts of structured data in order to provide parallel data processing capabilities for analytical workloads in shared-nothing clusters. One major challenge when horizontally partitioning large amounts of data is to reduce the network costs for a given workload and a database schema. A common technique to reduce the network costs in parallel database systems is to co-partition tables on their join key in order to avoid expensive remote join operations. However, existing partitioning schemes are limited in that respect since only subsets of tables in complex schemata sharing the same join key can be co-partitioned unless tables are fully replicated. In this paper we present a novel partitioning scheme called predicate-based reference partition (or PREF for short) that allows to co-partition sets of tables based on given join predicates. Moreover, based on PREF, we present two automatic partitioning design algorithms to maximize data-locality. One algorithm only needs the schema and data whereas the other algorithm additionally takes the workload as input. In our experiments we show that our automated design algorithms can partition database schemata of different complexity and thus help to effectively reduce the runtime of queries under a given workload when compared to existing partitioning approaches.","PeriodicalId":168391,"journal":{"name":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","volume":"108 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"66","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2723372.2723718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 66

Abstract

Parallel database systems horizontally partition large amounts of structured data in order to provide parallel data processing capabilities for analytical workloads in shared-nothing clusters. One major challenge when horizontally partitioning large amounts of data is to reduce the network costs for a given workload and a database schema. A common technique to reduce the network costs in parallel database systems is to co-partition tables on their join key in order to avoid expensive remote join operations. However, existing partitioning schemes are limited in that respect since only subsets of tables in complex schemata sharing the same join key can be co-partitioned unless tables are fully replicated. In this paper we present a novel partitioning scheme called predicate-based reference partition (or PREF for short) that allows to co-partition sets of tables based on given join predicates. Moreover, based on PREF, we present two automatic partitioning design algorithms to maximize data-locality. One algorithm only needs the schema and data whereas the other algorithm additionally takes the workload as input. In our experiments we show that our automated design algorithms can partition database schemata of different complexity and thus help to effectively reduce the runtime of queries under a given workload when compared to existing partitioning approaches.

查看原文本刊更多论文

并行数据库系统中的位置感知分区

并行数据库系统对大量结构化数据进行水平分区，以便为无共享集群中的分析工作负载提供并行数据处理能力。对大量数据进行水平分区时的一个主要挑战是降低给定工作负载和数据库模式的网络成本。在并行数据库系统中降低网络成本的一种常用技术是根据表的连接键对表进行共分区，以避免昂贵的远程连接操作。但是，现有的分区模式在这方面受到限制，因为只有共享相同连接键的复杂模式中的表子集才能进行共分区，除非表被完全复制。在本文中，我们提出了一种新的分区方案，称为基于谓词的引用分区(或简称PREF)，它允许基于给定的连接谓词对表集进行共分区。此外，基于PREF，我们提出了两种自动分区设计算法来最大化数据局部性。一种算法只需要模式和数据，而另一种算法额外地将工作负载作为输入。在我们的实验中，我们表明，与现有的分区方法相比，我们的自动设计算法可以对不同复杂性的数据库模式进行分区，从而有助于在给定工作负载下有效地减少查询的运行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

自引率

0.00%

发文量