并行查询处理的通信步骤

Journal of the ACM (JACM) Pub Date : 2017-10-14 DOI:10.1145/3125644

P. Beame, Paraschos Koutris, Dan Suciu

{"title":"并行查询处理的通信步骤","authors":"P. Beame, Paraschos Koutris, Dan Suciu","doi":"10.1145/3125644","DOIUrl":null,"url":null,"abstract":"We study the problem of computing conjunctive queries over large databases on parallel architectures without shared storage. Using the structure of such a query q and the skew in the data, we study tradeoffs between the number of processors, the number of rounds of communication, and the per-processor load—the number of bits each processor can send or can receive in a single round—that are required to compute q. Since each processor must store its received bits, the load is at most the number of bits of storage per processor. When the data are free of skew, we obtain essentially tight upper and lower bounds for one round algorithms, and we show how the bounds degrade when there is skew in the data. In the case of skewed data, we show how to improve the algorithms when approximate degrees of the (necessarily small number of) heavy-hitter elements are available, obtaining essentially optimal algorithms for queries such as skewed simple joins and skewed triangle join queries. For queries that we identify as treelike, we also prove nearly matching upper and lower bounds for multi-round algorithms for a natural class of skew-free databases. One consequence of these latter lower bounds is that for any ϵ > 0, using p processors to compute the connected components of a graph, or to output the path, if any, between a specified pair of vertices of a graph with m edges and per-processor load that is O(m/p1−ϵ) requires Ω(logp) rounds of communication. Our upper bounds are given by simple structured algorithms using MapReduce. Our one-round lower bounds are proved in a very general model, which we call the Massively Parallel Communication (MPC) model, that allows processors to communicate arbitrary bits. Our multi-round lower bounds apply in a restricted version of the MPC model in which processors in subsequent rounds after the first communication round are only allowed to send tuples.","PeriodicalId":17199,"journal":{"name":"Journal of the ACM (JACM)","volume":"48 1","pages":"1 - 58"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"46","resultStr":"{\"title\":\"Communication Steps for Parallel Query Processing\",\"authors\":\"P. Beame, Paraschos Koutris, Dan Suciu\",\"doi\":\"10.1145/3125644\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We study the problem of computing conjunctive queries over large databases on parallel architectures without shared storage. Using the structure of such a query q and the skew in the data, we study tradeoffs between the number of processors, the number of rounds of communication, and the per-processor load—the number of bits each processor can send or can receive in a single round—that are required to compute q. Since each processor must store its received bits, the load is at most the number of bits of storage per processor. When the data are free of skew, we obtain essentially tight upper and lower bounds for one round algorithms, and we show how the bounds degrade when there is skew in the data. In the case of skewed data, we show how to improve the algorithms when approximate degrees of the (necessarily small number of) heavy-hitter elements are available, obtaining essentially optimal algorithms for queries such as skewed simple joins and skewed triangle join queries. For queries that we identify as treelike, we also prove nearly matching upper and lower bounds for multi-round algorithms for a natural class of skew-free databases. One consequence of these latter lower bounds is that for any ϵ > 0, using p processors to compute the connected components of a graph, or to output the path, if any, between a specified pair of vertices of a graph with m edges and per-processor load that is O(m/p1−ϵ) requires Ω(logp) rounds of communication. Our upper bounds are given by simple structured algorithms using MapReduce. Our one-round lower bounds are proved in a very general model, which we call the Massively Parallel Communication (MPC) model, that allows processors to communicate arbitrary bits. Our multi-round lower bounds apply in a restricted version of the MPC model in which processors in subsequent rounds after the first communication round are only allowed to send tuples.\",\"PeriodicalId\":17199,\"journal\":{\"name\":\"Journal of the ACM (JACM)\",\"volume\":\"48 1\",\"pages\":\"1 - 58\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-10-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"46\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the ACM (JACM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3125644\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the ACM (JACM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3125644","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 46

摘要

研究了在无共享存储的并行架构下大型数据库的联合查询计算问题。使用这样的结构查询问和倾斜的数据,我们研究处理器的数量之间的权衡,轮的沟通,在每个处理器加载和每个处理器可以发送或接收的比特数在一个回合中,需要计算q。因为每个处理器必须存储接收比特,负载是最多存储每个处理器的位数。当数据没有倾斜时，我们获得了一轮算法的严格上界和下界，并且我们展示了当数据存在倾斜时边界是如何退化的。在倾斜数据的情况下，我们将展示如何在可用的重量级元素的近似程度(必须是少数)时改进算法，从而获得诸如倾斜简单连接和倾斜三角连接查询等查询的本质上最优的算法。对于我们识别为树状的查询，我们还证明了一类自然的无倾斜数据库的多轮算法的上界和下界几乎匹配。后一下界的一个结果是，对于任何一个> 0的λ，使用p个处理器来计算图的连接分量，或者输出路径(如果有的话)，具有m条边的图的指定顶点对和每处理器负载O(m/p1−λ)之间的路径需要Ω(logp)轮通信。我们的上界是由使用MapReduce的简单结构化算法给出的。我们的一轮下界在一个非常通用的模型中得到了证明，我们称之为大规模并行通信(MPC)模型，该模型允许处理器通信任意位。我们的多轮下界适用于MPC模型的限制版本，在该模型中，在第一轮通信之后的后续轮中的处理器只允许发送元组。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Communication Steps for Parallel Query Processing

We study the problem of computing conjunctive queries over large databases on parallel architectures without shared storage. Using the structure of such a query q and the skew in the data, we study tradeoffs between the number of processors, the number of rounds of communication, and the per-processor load—the number of bits each processor can send or can receive in a single round—that are required to compute q. Since each processor must store its received bits, the load is at most the number of bits of storage per processor. When the data are free of skew, we obtain essentially tight upper and lower bounds for one round algorithms, and we show how the bounds degrade when there is skew in the data. In the case of skewed data, we show how to improve the algorithms when approximate degrees of the (necessarily small number of) heavy-hitter elements are available, obtaining essentially optimal algorithms for queries such as skewed simple joins and skewed triangle join queries. For queries that we identify as treelike, we also prove nearly matching upper and lower bounds for multi-round algorithms for a natural class of skew-free databases. One consequence of these latter lower bounds is that for any ϵ > 0, using p processors to compute the connected components of a graph, or to output the path, if any, between a specified pair of vertices of a graph with m edges and per-processor load that is O(m/p1−ϵ) requires Ω(logp) rounds of communication. Our upper bounds are given by simple structured algorithms using MapReduce. Our one-round lower bounds are proved in a very general model, which we call the Massively Parallel Communication (MPC) model, that allows processors to communicate arbitrary bits. Our multi-round lower bounds apply in a restricted version of the MPC model in which processors in subsequent rounds after the first communication round are only allowed to send tuples.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the ACM (JACM)

自引率

0.00%

发文量