基于结构化数据洗牌的大数据分析栈优化案例研究

2016 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2015-09-08 DOI:10.1109/CLUSTER.2015.19

Dixin Tang, Taoying Liu, Rubao Lee, Hong Liu, Wei Li

{"title":"基于结构化数据洗牌的大数据分析栈优化案例研究","authors":"Dixin Tang, Taoying Liu, Rubao Lee, Hong Liu, Wei Li","doi":"10.1109/CLUSTER.2015.19","DOIUrl":null,"url":null,"abstract":"Current major big data analytical stacks often consist of a general-purpose, multi-staged cluster computation framework (e.g. Hadoop) and a SQL query execution system (e.g. Hive) on its top. In such stacks, a key factor of query execution performance is the efficiency of data shuffling between two execution stages (e.g. Map/Reduce). However, current stacks often execute data shuffling in a data-oblivious way, which means that for structured data processing, various useful information about the shuffled data and the queries on the data is simply wasted. Specifically, this problem makes two optimization opportunities lost: i) unnecessary records cannot be filtered in advance, ii) column-oriented compression algorithms cannot be applied. To solve the problem, in this paper, we have designed and implemented a novel data shuffling mechanism in Hadoop, called Structured Data Shuffling (S-Shuffle), which avoids the low efficiencies of traditional data shuffling by carefully leveraging the rich information in data and queries provided by Hive. Our experimental results with industry-standard TPC-H benchmark show that by using S-Shuffle, the performance of SQL query processing on Hadoop can be improved by up to 2.4x..","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Case Study of Optimizing Big Data Analytical Stacks Using Structured Data Shuffling\",\"authors\":\"Dixin Tang, Taoying Liu, Rubao Lee, Hong Liu, Wei Li\",\"doi\":\"10.1109/CLUSTER.2015.19\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current major big data analytical stacks often consist of a general-purpose, multi-staged cluster computation framework (e.g. Hadoop) and a SQL query execution system (e.g. Hive) on its top. In such stacks, a key factor of query execution performance is the efficiency of data shuffling between two execution stages (e.g. Map/Reduce). However, current stacks often execute data shuffling in a data-oblivious way, which means that for structured data processing, various useful information about the shuffled data and the queries on the data is simply wasted. Specifically, this problem makes two optimization opportunities lost: i) unnecessary records cannot be filtered in advance, ii) column-oriented compression algorithms cannot be applied. To solve the problem, in this paper, we have designed and implemented a novel data shuffling mechanism in Hadoop, called Structured Data Shuffling (S-Shuffle), which avoids the low efficiencies of traditional data shuffling by carefully leveraging the rich information in data and queries provided by Hive. Our experimental results with industry-standard TPC-H benchmark show that by using S-Shuffle, the performance of SQL query processing on Hadoop can be improved by up to 2.4x..\",\"PeriodicalId\":407471,\"journal\":{\"name\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CLUSTER.2015.19\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

当前主要的大数据分析栈通常由一个通用的、多阶段的集群计算框架(如Hadoop)和一个SQL查询执行系统(如Hive)组成。在这样的堆栈中，查询执行性能的一个关键因素是两个执行阶段(例如Map/Reduce)之间数据变换的效率。然而，当前的堆栈经常以一种数据无关的方式执行数据变换，这意味着对于结构化数据处理，关于被变换数据的各种有用信息和对数据的查询都被浪费了。具体来说，这个问题使两个优化机会丢失:1)不需要的记录不能提前过滤，2)不能应用面向列的压缩算法。为了解决这个问题，本文在Hadoop中设计并实现了一种新的数据洗牌机制，称为结构化数据洗牌(S-Shuffle)，通过精心利用Hive提供的数据和查询中的丰富信息，避免了传统数据洗牌的低效率。我们使用行业标准TPC-H基准测试的实验结果表明，通过使用S-Shuffle, Hadoop上的SQL查询处理性能可以提高2.4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Case Study of Optimizing Big Data Analytical Stacks Using Structured Data Shuffling

Current major big data analytical stacks often consist of a general-purpose, multi-staged cluster computation framework (e.g. Hadoop) and a SQL query execution system (e.g. Hive) on its top. In such stacks, a key factor of query execution performance is the efficiency of data shuffling between two execution stages (e.g. Map/Reduce). However, current stacks often execute data shuffling in a data-oblivious way, which means that for structured data processing, various useful information about the shuffled data and the queries on the data is simply wasted. Specifically, this problem makes two optimization opportunities lost: i) unnecessary records cannot be filtered in advance, ii) column-oriented compression algorithms cannot be applied. To solve the problem, in this paper, we have designed and implemented a novel data shuffling mechanism in Hadoop, called Structured Data Shuffling (S-Shuffle), which avoids the low efficiencies of traditional data shuffling by carefully leveraging the rich information in data and queries provided by Hive. Our experimental results with industry-standard TPC-H benchmark show that by using S-Shuffle, the performance of SQL query processing on Hadoop can be improved by up to 2.4x..

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE International Congress on Big Data (BigData Congress)

自引率

0.00%

发文量