在Hadoop集群上构建分布式大数据仓库的新颖物理设计，提高OLAP多维数据集查询性能

IF 2 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing Pub Date : 2022-07-01 DOI:10.1016/j.parco.2022.102918

Yassine Ramdane , Omar Boussaid , Doulkifli Boukraà , Nadia Kabachi , Fadila Bentayeb

{"title":"在Hadoop集群上构建分布式大数据仓库的新颖物理设计，提高OLAP多维数据集查询性能","authors":"Yassine Ramdane , Omar Boussaid , Doulkifli Boukraà , Nadia Kabachi , Fadila Bentayeb","doi":"10.1016/j.parco.2022.102918","DOIUrl":null,"url":null,"abstract":"<div><p><span>Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of </span>Hadoop<span> is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system’s query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time.</span></p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"111 ","pages":"Article 102918"},"PeriodicalIF":2.0000,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance\",\"authors\":\"Yassine Ramdane , Omar Boussaid , Doulkifli Boukraà , Nadia Kabachi , Fadila Bentayeb\",\"doi\":\"10.1016/j.parco.2022.102918\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p><span>Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of </span>Hadoop<span> is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system’s query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time.</span></p></div>\",\"PeriodicalId\":54642,\"journal\":{\"name\":\"Parallel Computing\",\"volume\":\"111 \",\"pages\":\"Article 102918\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2022-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Parallel Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167819122000230\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819122000230","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 5

摘要

在Hadoop之上的分布式系统中提高OLAP(在线分析处理)查询性能是一项具有挑战性的任务。OLAP Cube查询包含几个关系操作，例如选择、连接和分组聚合。众所周知，星型连接和分组聚合是Hadoop数据库系统中成本最高的操作。这些操作确实增加了网络流量并可能溢出内存;为了克服这些困难，文献中提出了许多分区和数据负载平衡技术。但是，有些问题仍然存在疑问，例如减少Spark阶段和在分布式系统上执行OLAP查询的网络I/O。在之前的工作中，我们为Hadoop集群上的大数据仓库提出了一种新的数据放置策略。这个数据仓库模式增强了OLAP查询的投影、选择和星型连接操作，这样系统的查询优化器就可以在本地执行星型连接过程，只需要一个火花阶段而不需要shuffle阶段。此外，在执行谓词时，系统可以跳过加载不必要的数据块。在本文中，我们通过进一步的技术细节和实验扩展了之前的工作，并提出了一种新的动态方法来改进分组聚合。为了评估我们的方法，我们在一个有15个节点的集群上进行了一些实验。实验结果表明，该方法在OLAP查询评估时间方面优于现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Building a novel physical design of a distributed big data warehouse over a Hadoop cluster to enhance OLAP cube query performance

Improving OLAP (Online Analytical Processing) query performance in a distributed system on top of Hadoop is a challenging task. An OLAP Cube query comprises several relational operations, such as selection, join, and group-by aggregation. It is well-known that star join and group-by aggregation are the most costly operations in a Hadoop database system. These operations indeed increase network traffic and may overflow memory; to overcome these difficulties, numerous partitioning and data load balancing techniques have been proposed in the literature. However, some issues remain questionable, such as decreasing the Spark stages and the network I/O for an OLAP query being executed on a distributed system. In a precedent work, we proposed a novel data placement strategy for a big data warehouse over a Hadoop cluster. This data warehouse schema enhances the projection, selection, and star-join operations of an OLAP query, such that the system’s query-optimizer can perform a star join process locally, in only one spark stage without a shuffle phase. Also, the system can skip loading unnecessary data blocks when executing the predicates. In this paper, we extend our previous work with further technical details and experiments, and we propose a new dynamic approach to improve the group-by aggregation. To evaluate our approach, we conduct some experiments on a cluster with 15 nodes. Experimental results show that our method outperforms existing approaches in terms of OLAP query evaluation time.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Parallel Computing 工程技术-计算机：理论方法

CiteScore

3.50

自引率

7.10%

发文量

审稿时长

4.5 months

期刊介绍： Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications