A Guide to Formal Analysis of Join Processing in Massively Parallel Systems

Paraschos Koutris, Dan Suciu
{"title":"A Guide to Formal Analysis of Join Processing in Massively Parallel Systems","authors":"Paraschos Koutris, Dan Suciu","doi":"10.1145/3092931.3092934","DOIUrl":null,"url":null,"abstract":"Over the last decade, there has been an enormous increase in the volume of data that is being stored, processed and analyzed. In order to improve the performance of query processing on such amounts of data, many modern data management systems (e.g. Spark [23, 28], Hadoop [13, 9, 24], and others [19, 14]) have resorted to the power of parallelism to speed up computation. Parallelism enables the distribution of computation for data-intensive tasks into hundreds, or even thousands of machines, and thus significantly reduces the completion time for several crucial data processing tasks. In this paper, we present a survey on recent results [18, 4, 5, 17] that study the computational complexity of mulitway join processing in such massively parallel systems. Our goal is twofold. First, we introduce a simple theoretical model, called the MPC (Massively Parallel Computation) model, that allows us to rigorously analyze the computational complexity of various parallel algorithms for query processing. Second, using the MPC model as a theoretical tool, we show how we can design novel algorithms and techniques for multiway join processing, and how we can prove their optimality through tight lower bounds. Our analysis provides a deeper understanding of how much synchronization, communication and data load is required when we compute a multiway join query, and informs of what is possible to achieve under specific system constraints.","PeriodicalId":21740,"journal":{"name":"SIGMOD Rec.","volume":"24 1","pages":"18-27"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIGMOD Rec.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3092931.3092934","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Over the last decade, there has been an enormous increase in the volume of data that is being stored, processed and analyzed. In order to improve the performance of query processing on such amounts of data, many modern data management systems (e.g. Spark [23, 28], Hadoop [13, 9, 24], and others [19, 14]) have resorted to the power of parallelism to speed up computation. Parallelism enables the distribution of computation for data-intensive tasks into hundreds, or even thousands of machines, and thus significantly reduces the completion time for several crucial data processing tasks. In this paper, we present a survey on recent results [18, 4, 5, 17] that study the computational complexity of mulitway join processing in such massively parallel systems. Our goal is twofold. First, we introduce a simple theoretical model, called the MPC (Massively Parallel Computation) model, that allows us to rigorously analyze the computational complexity of various parallel algorithms for query processing. Second, using the MPC model as a theoretical tool, we show how we can design novel algorithms and techniques for multiway join processing, and how we can prove their optimality through tight lower bounds. Our analysis provides a deeper understanding of how much synchronization, communication and data load is required when we compute a multiway join query, and informs of what is possible to achieve under specific system constraints.
大规模并行系统中联接处理的形式化分析指南
在过去的十年中,存储、处理和分析的数据量有了巨大的增长。为了提高对如此大量数据的查询处理性能,许多现代数据管理系统(例如Spark [23,28], Hadoop[13,9,24],以及其他[19,14])已经求助于并行的力量来加速计算。并行性允许将数据密集型任务的计算分布到数百甚至数千台机器上,从而大大减少了几个关键数据处理任务的完成时间。在本文中,我们综述了最近的一些研究结果[18,4,5,17],这些结果研究了这种大规模并行系统中多路连接处理的计算复杂度。我们的目标是双重的。首先,我们介绍了一个简单的理论模型,称为MPC(大规模并行计算)模型,它允许我们严格分析用于查询处理的各种并行算法的计算复杂性。其次,使用MPC模型作为理论工具,我们展示了如何为多路连接处理设计新的算法和技术,以及如何通过严格的下界证明它们的最优性。我们的分析让我们更深入地了解了在计算多路连接查询时需要多少同步、通信和数据负载,并告知了在特定的系统约束下可能实现的目标。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信