利用平衡覆盖计划优化云数据湖查询

IF 5 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing Pub Date : 2023-12-05 DOI:10.1109/TCC.2023.3339208

Grisha Weintraub;Ehud Gudes;Shlomi Dolev;Jeffrey D. Ullman

{"title":"利用平衡覆盖计划优化云数据湖查询","authors":"Grisha Weintraub;Ehud Gudes;Shlomi Dolev;Jeffrey D. Ullman","doi":"10.1109/TCC.2023.3339208","DOIUrl":null,"url":null,"abstract":"Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in “on-demand” mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. In this paper, we study different approaches to improve query performance in a data lake architecture. We define an optimization problem that can provably speed up data lake queries. We prove that the problem is NP-hard and suggest heuristic approaches. Then, we demonstrate through the experiments that our approach is feasible and efficient (up to ×30 query execution time improvement based on the TPC-H benchmark).","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 1","pages":"84-99"},"PeriodicalIF":5.0000,"publicationDate":"2023-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimizing Cloud Data Lake Queries With a Balanced Coverage Plan\",\"authors\":\"Grisha Weintraub;Ehud Gudes;Shlomi Dolev;Jeffrey D. Ullman\",\"doi\":\"10.1109/TCC.2023.3339208\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in “on-demand” mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. In this paper, we study different approaches to improve query performance in a data lake architecture. We define an optimization problem that can provably speed up data lake queries. We prove that the problem is NP-hard and suggest heuristic approaches. Then, we demonstrate through the experiments that our approach is feasible and efficient (up to ×30 query execution time improvement based on the TPC-H benchmark).\",\"PeriodicalId\":13202,\"journal\":{\"name\":\"IEEE Transactions on Cloud Computing\",\"volume\":\"12 1\",\"pages\":\"84-99\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2023-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Cloud Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10342737/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cloud Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10342737/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

云数据湖是存储海量数据的一种廉价解决方案。其主要理念是将计算层和存储层分离。因此，廉价的云存储用于存储数据，而计算引擎则用于以 "按需 "模式在这些数据上运行分析。然而，在这种架构中，要对数据进行任何计算，每次计算都要通过网络将数据从存储层移动到计算层。这显然会损害计算性能，而且需要巨大的网络带宽。本文研究了在数据湖架构中提高查询性能的不同方法。我们定义了一个优化问题，可以证明它能加快数据湖查询的速度。我们证明了该问题的 NP 难度，并提出了启发式方法。然后，我们通过实验证明了我们的方法是可行且高效的（根据 TPC-H 基准，查询执行时间最多可提高 ×30）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing Cloud Data Lake Queries With a Balanced Coverage Plan

Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in “on-demand” mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. In this paper, we study different approaches to improve query performance in a data lake architecture. We define an optimization problem that can provably speed up data lake queries. We prove that the problem is NP-hard and suggest heuristic approaches. Then, we demonstrate through the experiments that our approach is feasible and efficient (up to ×30 query execution time improvement based on the TPC-H benchmark).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Cloud Computing Computer Science-Software

CiteScore

9.40

自引率

6.20%

发文量

167

期刊介绍： The IEEE Transactions on Cloud Computing (TCC) is dedicated to the multidisciplinary field of cloud computing. It is committed to the publication of articles that present innovative research ideas, application results, and case studies in cloud computing, focusing on key technical issues related to theory, algorithms, systems, applications, and performance.