A Guide to Formal Analysis of Join Processing in Massively Parallel Systems

SIGMOD Rec. Pub Date : 2017-05-11 DOI:10.1145/3092931.3092934

Paraschos Koutris, Dan Suciu

{"title":"A Guide to Formal Analysis of Join Processing in Massively Parallel Systems","authors":"Paraschos Koutris, Dan Suciu","doi":"10.1145/3092931.3092934","DOIUrl":null,"url":null,"abstract":"Over the last decade, there has been an enormous increase in the volume of data that is being stored, processed and analyzed. In order to improve the performance of query processing on such amounts of data, many modern data management systems (e.g. Spark [23, 28], Hadoop [13, 9, 24], and others [19, 14]) have resorted to the power of parallelism to speed up computation. Parallelism enables the distribution of computation for data-intensive tasks into hundreds, or even thousands of machines, and thus significantly reduces the completion time for several crucial data processing tasks. In this paper, we present a survey on recent results [18, 4, 5, 17] that study the computational complexity of mulitway join processing in such massively parallel systems. Our goal is twofold. First, we introduce a simple theoretical model, called the MPC (Massively Parallel Computation) model, that allows us to rigorously analyze the computational complexity of various parallel algorithms for query processing. Second, using the MPC model as a theoretical tool, we show how we can design novel algorithms and techniques for multiway join processing, and how we can prove their optimality through tight lower bounds. Our analysis provides a deeper understanding of how much synchronization, communication and data load is required when we compute a multiway join query, and informs of what is possible to achieve under specific system constraints.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"24 1","pages":"18-27"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIGMOD Rec.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3092931.3092934","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Over the last decade, there has been an enormous increase in the volume of data that is being stored, processed and analyzed. In order to improve the performance of query processing on such amounts of data, many modern data management systems (e.g. Spark [23, 28], Hadoop [13, 9, 24], and others [19, 14]) have resorted to the power of parallelism to speed up computation. Parallelism enables the distribution of computation for data-intensive tasks into hundreds, or even thousands of machines, and thus significantly reduces the completion time for several crucial data processing tasks. In this paper, we present a survey on recent results [18, 4, 5, 17] that study the computational complexity of mulitway join processing in such massively parallel systems. Our goal is twofold. First, we introduce a simple theoretical model, called the MPC (Massively Parallel Computation) model, that allows us to rigorously analyze the computational complexity of various parallel algorithms for query processing. Second, using the MPC model as a theoretical tool, we show how we can design novel algorithms and techniques for multiway join processing, and how we can prove their optimality through tight lower bounds. Our analysis provides a deeper understanding of how much synchronization, communication and data load is required when we compute a multiway join query, and informs of what is possible to achieve under specific system constraints.

查看原文本刊更多论文

大规模并行系统中联接处理的形式化分析指南

在过去的十年中，存储、处理和分析的数据量有了巨大的增长。为了提高对如此大量数据的查询处理性能，许多现代数据管理系统(例如Spark [23,28]， Hadoop[13,9,24]，以及其他[19,14])已经求助于并行的力量来加速计算。并行性允许将数据密集型任务的计算分布到数百甚至数千台机器上，从而大大减少了几个关键数据处理任务的完成时间。在本文中，我们综述了最近的一些研究结果[18,4,5,17]，这些结果研究了这种大规模并行系统中多路连接处理的计算复杂度。我们的目标是双重的。首先，我们介绍了一个简单的理论模型，称为MPC(大规模并行计算)模型，它允许我们严格分析用于查询处理的各种并行算法的计算复杂性。其次，使用MPC模型作为理论工具，我们展示了如何为多路连接处理设计新的算法和技术，以及如何通过严格的下界证明它们的最优性。我们的分析让我们更深入地了解了在计算多路连接查询时需要多少同步、通信和数据负载，并告知了在特定的系统约束下可能实现的目标。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SIGMOD Rec.

自引率

0.00%

发文量