{"title":"Query optimization strategies for big data","authors":"Nagesh Bhattu Sristy, Prashanth Kadari, Harini Yadamreddy","doi":"10.1049/pbpc037f_ch4","DOIUrl":null,"url":null,"abstract":"Query optimization for big data architectures like MapReduce, Spark, and Druid is challenging due to the numerosity of the algorithmic issues to be addressed. Conventional algorithmic design issues like memory, CPU time, IO cost should be analyzed in the context of additional parameters such as communication cost. The issue of data resident skew further complicates the analysis. This chapter studies the communication cost reduction strategies for conventional workloads such as joins, spatial queries, and graph queries. We review the algorithms for multi-way join using MapReduce. Multi-way θ-join algorithms address the multi-way join with inequality conditions. As θ-join output is much higher compared to the output of equi join, multi-way θ-join further poses difficulties for the analysis. An analysis of multi-way θ-join is presented on the basis of sizes of input sets, output sets as well as the communication cost. Data resident skew plays a key role in all the scenarios discussed. Addressing the skew in a general sense is discussed. Partitioning strategies that minimize the impact of skew on the skew in loads of computing nodes are also further presented. Application of join strategies for the spatial data has dragged the interest of researchers, and distribution of spatial join requires special emphasis for dealing with the spatial nature of the dataset. A controlled replicate strategy is reviewed to solve the problem of multi-way spatial join. Graph-based analytical queries such as triangle counting and subgraph enumeration in the context of distributed processing are presented. Being a primitive needed for many graph queries, triangle counting has been analyzed from the perspective of skew it brings using an elegant distribution scheme. Subgraph enumeration problem is also presented using various partitioning schemes and a brief analysis of their performance.","PeriodicalId":162132,"journal":{"name":"Handbook of Big Data Analytics. Volume 1: Methodologies","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Handbook of Big Data Analytics. Volume 1: Methodologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1049/pbpc037f_ch4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Query optimization for big data architectures like MapReduce, Spark, and Druid is challenging due to the numerosity of the algorithmic issues to be addressed. Conventional algorithmic design issues like memory, CPU time, IO cost should be analyzed in the context of additional parameters such as communication cost. The issue of data resident skew further complicates the analysis. This chapter studies the communication cost reduction strategies for conventional workloads such as joins, spatial queries, and graph queries. We review the algorithms for multi-way join using MapReduce. Multi-way θ-join algorithms address the multi-way join with inequality conditions. As θ-join output is much higher compared to the output of equi join, multi-way θ-join further poses difficulties for the analysis. An analysis of multi-way θ-join is presented on the basis of sizes of input sets, output sets as well as the communication cost. Data resident skew plays a key role in all the scenarios discussed. Addressing the skew in a general sense is discussed. Partitioning strategies that minimize the impact of skew on the skew in loads of computing nodes are also further presented. Application of join strategies for the spatial data has dragged the interest of researchers, and distribution of spatial join requires special emphasis for dealing with the spatial nature of the dataset. A controlled replicate strategy is reviewed to solve the problem of multi-way spatial join. Graph-based analytical queries such as triangle counting and subgraph enumeration in the context of distributed processing are presented. Being a primitive needed for many graph queries, triangle counting has been analyzed from the perspective of skew it brings using an elegant distribution scheme. Subgraph enumeration problem is also presented using various partitioning schemes and a brief analysis of their performance.