MapReduce环境下列式数据存储的Join优化

2010 Sixth International Conference on Semantics, Knowledge and Grids Pub Date : 2010-11-01 DOI:10.1109/SKG.2010.18

Minqi Zhou, Rong Zhang, Dadan Zeng, Weining Qian, Aoying Zhou

{"title":"MapReduce环境下列式数据存储的Join优化","authors":"Minqi Zhou, Rong Zhang, Dadan Zeng, Weining Qian, Aoying Zhou","doi":"10.1109/SKG.2010.18","DOIUrl":null,"url":null,"abstract":"The chain join processing which combines records from two or more tables sequentially has been well studied in the centralized databases. However, it has seldom been discussed in the cloud computing era, and remains imperative to be solved, especially where structured (or relational) data are stored in a column (attribute) wise fashion in distributed file systems (e.g., Google File System) over hundreds of or even thousands of commodities PCs. In this paper, we propose a novel method for chain join processing, which is one of the common primitives in the cloud era for column-wise stored data analysis. By effectively selecting the dedicated records (tuples) for the chain join based on the information exploited within bipartite join graph, communication cost for record transmission could be reduced dramatically. A bushy tree structure is deployed to regulate the chain join sequence, which further reduces the number of intermediate results generated and transmitted, and explores higher parallelism in join processing, while results in more efficient join processing. Our extensive performance study confirms the effectiveness and efficiency of our methods.","PeriodicalId":105513,"journal":{"name":"2010 Sixth International Conference on Semantics, Knowledge and Grids","volume":"82 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":"{\"title\":\"Join Optimization in the MapReduce Environment for Column-wise Data Store\",\"authors\":\"Minqi Zhou, Rong Zhang, Dadan Zeng, Weining Qian, Aoying Zhou\",\"doi\":\"10.1109/SKG.2010.18\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The chain join processing which combines records from two or more tables sequentially has been well studied in the centralized databases. However, it has seldom been discussed in the cloud computing era, and remains imperative to be solved, especially where structured (or relational) data are stored in a column (attribute) wise fashion in distributed file systems (e.g., Google File System) over hundreds of or even thousands of commodities PCs. In this paper, we propose a novel method for chain join processing, which is one of the common primitives in the cloud era for column-wise stored data analysis. By effectively selecting the dedicated records (tuples) for the chain join based on the information exploited within bipartite join graph, communication cost for record transmission could be reduced dramatically. A bushy tree structure is deployed to regulate the chain join sequence, which further reduces the number of intermediate results generated and transmitted, and explores higher parallelism in join processing, while results in more efficient join processing. Our extensive performance study confirms the effectiveness and efficiency of our methods.\",\"PeriodicalId\":105513,\"journal\":{\"name\":\"2010 Sixth International Conference on Semantics, Knowledge and Grids\",\"volume\":\"82 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"14\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 Sixth International Conference on Semantics, Knowledge and Grids\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SKG.2010.18\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 Sixth International Conference on Semantics, Knowledge and Grids","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SKG.2010.18","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

摘要

在集中式数据库中，将两个或多个表的记录按顺序组合在一起的链式连接处理已经得到了很好的研究。然而，它在云计算时代很少被讨论，并且仍然是迫切需要解决的问题，特别是当结构化(或关系)数据以列(属性)明智的方式存储在分布式文件系统(例如，Google文件系统)中，超过数百甚至数千台商品pc。在本文中，我们提出了一种新的链连接处理方法，这是云时代用于列式存储数据分析的常见原语之一。基于二部连接图中所利用的信息，有效地选择用于链连接的专用记录(元组)，可以显著降低记录传输的通信成本。采用灌木树结构对链连接序列进行调节，进一步减少了中间结果的生成和传输，并在连接处理中探索了更高的并行性，从而提高了连接处理效率。我们广泛的性能研究证实了我们方法的有效性和效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Join Optimization in the MapReduce Environment for Column-wise Data Store

The chain join processing which combines records from two or more tables sequentially has been well studied in the centralized databases. However, it has seldom been discussed in the cloud computing era, and remains imperative to be solved, especially where structured (or relational) data are stored in a column (attribute) wise fashion in distributed file systems (e.g., Google File System) over hundreds of or even thousands of commodities PCs. In this paper, we propose a novel method for chain join processing, which is one of the common primitives in the cloud era for column-wise stored data analysis. By effectively selecting the dedicated records (tuples) for the chain join based on the information exploited within bipartite join graph, communication cost for record transmission could be reduced dramatically. A bushy tree structure is deployed to regulate the chain join sequence, which further reduces the number of intermediate results generated and transmitted, and explores higher parallelism in join processing, while results in more efficient join processing. Our extensive performance study confirms the effectiveness and efficiency of our methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 Sixth International Conference on Semantics, Knowledge and Grids

自引率

0.00%

发文量