Dixin Tang, Taoying Liu, Rubao Lee, Hong Liu, Wei Li
{"title":"A Case Study of Optimizing Big Data Analytical Stacks Using Structured Data Shuffling","authors":"Dixin Tang, Taoying Liu, Rubao Lee, Hong Liu, Wei Li","doi":"10.1109/CLUSTER.2015.19","DOIUrl":null,"url":null,"abstract":"Current major big data analytical stacks often consist of a general-purpose, multi-staged cluster computation framework (e.g. Hadoop) and a SQL query execution system (e.g. Hive) on its top. In such stacks, a key factor of query execution performance is the efficiency of data shuffling between two execution stages (e.g. Map/Reduce). However, current stacks often execute data shuffling in a data-oblivious way, which means that for structured data processing, various useful information about the shuffled data and the queries on the data is simply wasted. Specifically, this problem makes two optimization opportunities lost: i) unnecessary records cannot be filtered in advance, ii) column-oriented compression algorithms cannot be applied. To solve the problem, in this paper, we have designed and implemented a novel data shuffling mechanism in Hadoop, called Structured Data Shuffling (S-Shuffle), which avoids the low efficiencies of traditional data shuffling by carefully leveraging the rich information in data and queries provided by Hive. Our experimental results with industry-standard TPC-H benchmark show that by using S-Shuffle, the performance of SQL query processing on Hadoop can be improved by up to 2.4x..","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Current major big data analytical stacks often consist of a general-purpose, multi-staged cluster computation framework (e.g. Hadoop) and a SQL query execution system (e.g. Hive) on its top. In such stacks, a key factor of query execution performance is the efficiency of data shuffling between two execution stages (e.g. Map/Reduce). However, current stacks often execute data shuffling in a data-oblivious way, which means that for structured data processing, various useful information about the shuffled data and the queries on the data is simply wasted. Specifically, this problem makes two optimization opportunities lost: i) unnecessary records cannot be filtered in advance, ii) column-oriented compression algorithms cannot be applied. To solve the problem, in this paper, we have designed and implemented a novel data shuffling mechanism in Hadoop, called Structured Data Shuffling (S-Shuffle), which avoids the low efficiencies of traditional data shuffling by carefully leveraging the rich information in data and queries provided by Hive. Our experimental results with industry-standard TPC-H benchmark show that by using S-Shuffle, the performance of SQL query processing on Hadoop can be improved by up to 2.4x..