Query optimization strategies for big data

Handbook of Big Data Analytics. Volume 1: Methodologies Pub Date : 2021-07-07 DOI:10.1049/pbpc037f_ch4

Nagesh Bhattu Sristy, Prashanth Kadari, Harini Yadamreddy

{"title":"Query optimization strategies for big data","authors":"Nagesh Bhattu Sristy, Prashanth Kadari, Harini Yadamreddy","doi":"10.1049/pbpc037f_ch4","DOIUrl":null,"url":null,"abstract":"Query optimization for big data architectures like MapReduce, Spark, and Druid is challenging due to the numerosity of the algorithmic issues to be addressed. Conventional algorithmic design issues like memory, CPU time, IO cost should be analyzed in the context of additional parameters such as communication cost. The issue of data resident skew further complicates the analysis. This chapter studies the communication cost reduction strategies for conventional workloads such as joins, spatial queries, and graph queries. We review the algorithms for multi-way join using MapReduce. Multi-way θ-join algorithms address the multi-way join with inequality conditions. As θ-join output is much higher compared to the output of equi join, multi-way θ-join further poses difficulties for the analysis. An analysis of multi-way θ-join is presented on the basis of sizes of input sets, output sets as well as the communication cost. Data resident skew plays a key role in all the scenarios discussed. Addressing the skew in a general sense is discussed. Partitioning strategies that minimize the impact of skew on the skew in loads of computing nodes are also further presented. Application of join strategies for the spatial data has dragged the interest of researchers, and distribution of spatial join requires special emphasis for dealing with the spatial nature of the dataset. A controlled replicate strategy is reviewed to solve the problem of multi-way spatial join. Graph-based analytical queries such as triangle counting and subgraph enumeration in the context of distributed processing are presented. Being a primitive needed for many graph queries, triangle counting has been analyzed from the perspective of skew it brings using an elegant distribution scheme. Subgraph enumeration problem is also presented using various partitioning schemes and a brief analysis of their performance.","PeriodicalId":162132,"journal":{"name":"Handbook of Big Data Analytics. Volume 1: Methodologies","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Handbook of Big Data Analytics. Volume 1: Methodologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1049/pbpc037f_ch4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Query optimization for big data architectures like MapReduce, Spark, and Druid is challenging due to the numerosity of the algorithmic issues to be addressed. Conventional algorithmic design issues like memory, CPU time, IO cost should be analyzed in the context of additional parameters such as communication cost. The issue of data resident skew further complicates the analysis. This chapter studies the communication cost reduction strategies for conventional workloads such as joins, spatial queries, and graph queries. We review the algorithms for multi-way join using MapReduce. Multi-way θ-join algorithms address the multi-way join with inequality conditions. As θ-join output is much higher compared to the output of equi join, multi-way θ-join further poses difficulties for the analysis. An analysis of multi-way θ-join is presented on the basis of sizes of input sets, output sets as well as the communication cost. Data resident skew plays a key role in all the scenarios discussed. Addressing the skew in a general sense is discussed. Partitioning strategies that minimize the impact of skew on the skew in loads of computing nodes are also further presented. Application of join strategies for the spatial data has dragged the interest of researchers, and distribution of spatial join requires special emphasis for dealing with the spatial nature of the dataset. A controlled replicate strategy is reviewed to solve the problem of multi-way spatial join. Graph-based analytical queries such as triangle counting and subgraph enumeration in the context of distributed processing are presented. Being a primitive needed for many graph queries, triangle counting has been analyzed from the perspective of skew it brings using an elegant distribution scheme. Subgraph enumeration problem is also presented using various partitioning schemes and a brief analysis of their performance.

查看原文本刊更多论文

面向大数据的查询优化策略

对于像MapReduce、Spark和Druid这样的大数据架构来说，查询优化是具有挑战性的，因为需要解决大量的算法问题。传统的算法设计问题，如内存、CPU时间、IO成本，应该在通信成本等附加参数的背景下进行分析。数据驻留偏差的问题使分析进一步复杂化。本章研究了传统工作负载(如连接、空间查询和图查询)的通信成本降低策略。我们回顾了使用MapReduce的多路连接算法。多路θ-联接算法解决具有不等式条件的多路联接问题。由于θ-join的输出比equi join的输出高得多，多路θ-join进一步给分析带来了困难。基于输入集的大小、输出集的大小以及通信代价，对多路θ-连接进行了分析。数据驻留偏差在讨论的所有场景中都起着关键作用。讨论了在一般意义上解决倾斜的问题。在此基础上，进一步提出了最小化倾斜对计算节点负载倾斜影响的分区策略。空间数据连接策略的应用一直是研究人员关注的焦点，而空间连接的分布需要特别注意处理数据集的空间性质。介绍了一种用于解决多路空间连接问题的可控复制策略。在分布式处理的背景下，提出了基于图的分析查询，如三角形计数和子图枚举。三角形计数是许多图查询所需的原语，从它带来的倾斜的角度分析了它使用优雅的分布方案。提出了各种分区方案下的子图枚举问题，并简要分析了它们的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Handbook of Big Data Analytics. Volume 1: Methodologies

自引率

0.00%

发文量