Dixin Tang, Taoying Liu, Rubao Lee, Hong Liu, Wei Li
{"title":"基于结构化数据洗牌的大数据分析栈优化案例研究","authors":"Dixin Tang, Taoying Liu, Rubao Lee, Hong Liu, Wei Li","doi":"10.1109/CLUSTER.2015.19","DOIUrl":null,"url":null,"abstract":"Current major big data analytical stacks often consist of a general-purpose, multi-staged cluster computation framework (e.g. Hadoop) and a SQL query execution system (e.g. Hive) on its top. In such stacks, a key factor of query execution performance is the efficiency of data shuffling between two execution stages (e.g. Map/Reduce). However, current stacks often execute data shuffling in a data-oblivious way, which means that for structured data processing, various useful information about the shuffled data and the queries on the data is simply wasted. Specifically, this problem makes two optimization opportunities lost: i) unnecessary records cannot be filtered in advance, ii) column-oriented compression algorithms cannot be applied. To solve the problem, in this paper, we have designed and implemented a novel data shuffling mechanism in Hadoop, called Structured Data Shuffling (S-Shuffle), which avoids the low efficiencies of traditional data shuffling by carefully leveraging the rich information in data and queries provided by Hive. Our experimental results with industry-standard TPC-H benchmark show that by using S-Shuffle, the performance of SQL query processing on Hadoop can be improved by up to 2.4x..","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Case Study of Optimizing Big Data Analytical Stacks Using Structured Data Shuffling\",\"authors\":\"Dixin Tang, Taoying Liu, Rubao Lee, Hong Liu, Wei Li\",\"doi\":\"10.1109/CLUSTER.2015.19\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Current major big data analytical stacks often consist of a general-purpose, multi-staged cluster computation framework (e.g. Hadoop) and a SQL query execution system (e.g. Hive) on its top. In such stacks, a key factor of query execution performance is the efficiency of data shuffling between two execution stages (e.g. Map/Reduce). However, current stacks often execute data shuffling in a data-oblivious way, which means that for structured data processing, various useful information about the shuffled data and the queries on the data is simply wasted. Specifically, this problem makes two optimization opportunities lost: i) unnecessary records cannot be filtered in advance, ii) column-oriented compression algorithms cannot be applied. To solve the problem, in this paper, we have designed and implemented a novel data shuffling mechanism in Hadoop, called Structured Data Shuffling (S-Shuffle), which avoids the low efficiencies of traditional data shuffling by carefully leveraging the rich information in data and queries provided by Hive. Our experimental results with industry-standard TPC-H benchmark show that by using S-Shuffle, the performance of SQL query processing on Hadoop can be improved by up to 2.4x..\",\"PeriodicalId\":407471,\"journal\":{\"name\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"volume\":\"45 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-09-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CLUSTER.2015.19\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Case Study of Optimizing Big Data Analytical Stacks Using Structured Data Shuffling
Current major big data analytical stacks often consist of a general-purpose, multi-staged cluster computation framework (e.g. Hadoop) and a SQL query execution system (e.g. Hive) on its top. In such stacks, a key factor of query execution performance is the efficiency of data shuffling between two execution stages (e.g. Map/Reduce). However, current stacks often execute data shuffling in a data-oblivious way, which means that for structured data processing, various useful information about the shuffled data and the queries on the data is simply wasted. Specifically, this problem makes two optimization opportunities lost: i) unnecessary records cannot be filtered in advance, ii) column-oriented compression algorithms cannot be applied. To solve the problem, in this paper, we have designed and implemented a novel data shuffling mechanism in Hadoop, called Structured Data Shuffling (S-Shuffle), which avoids the low efficiencies of traditional data shuffling by carefully leveraging the rich information in data and queries provided by Hive. Our experimental results with industry-standard TPC-H benchmark show that by using S-Shuffle, the performance of SQL query processing on Hadoop can be improved by up to 2.4x..