Data Lakehouse: A survey and experimental study

IF 3 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems Pub Date : 2024-09-26 DOI:10.1016/j.is.2024.102460

Ahmed A. Harby , Farhana Zulkernine

{"title":"Data Lakehouse: A survey and experimental study","authors":"Ahmed A. Harby , Farhana Zulkernine","doi":"10.1016/j.is.2024.102460","DOIUrl":null,"url":null,"abstract":"<div><div>Efficient big data management is a dire necessity to manage the exponential growth in data generated by digital information systems to produce usable knowledge. Structured databases, data lakes, and warehouses have each provided a solution with varying degrees of success. However, a new and superior solution, the data Lakehouse, has emerged to extract actionable insights from unstructured data ingested from distributed sources. By combining the strengths of data warehouses and data lakes, the data Lakehouse can process and merge data quickly while ingesting and storing high-speed unstructured data with post-storage transformation and analytics capabilities. The Lakehouse architecture offers the necessary features for optimal functionality and has gained significant attention in the big data management research community. In this paper, we compare data lake, warehouse, and lakehouse systems, highlight their strengths and shortcomings, identify the desired features to handle the evolving challenges in big data management and analysis and propose an advanced data Lakehouse architecture. We also demonstrate the performance of three state-of-the-art data management systems namely HDFS data lake, Hive data warehouse, and Delta lakehouse in managing data for analytical query responses through an experimental study.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"127 ","pages":"Article 102460"},"PeriodicalIF":3.0000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437924001182","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Efficient big data management is a dire necessity to manage the exponential growth in data generated by digital information systems to produce usable knowledge. Structured databases, data lakes, and warehouses have each provided a solution with varying degrees of success. However, a new and superior solution, the data Lakehouse, has emerged to extract actionable insights from unstructured data ingested from distributed sources. By combining the strengths of data warehouses and data lakes, the data Lakehouse can process and merge data quickly while ingesting and storing high-speed unstructured data with post-storage transformation and analytics capabilities. The Lakehouse architecture offers the necessary features for optimal functionality and has gained significant attention in the big data management research community. In this paper, we compare data lake, warehouse, and lakehouse systems, highlight their strengths and shortcomings, identify the desired features to handle the evolving challenges in big data management and analysis and propose an advanced data Lakehouse architecture. We also demonstrate the performance of three state-of-the-art data management systems namely HDFS data lake, Hive data warehouse, and Delta lakehouse in managing data for analytical query responses through an experimental study.

Abstract Image

查看原文本刊更多论文

数据湖：调查与实验研究

高效的大数据管理是管理数字信息系统产生的指数级增长数据以产生可用知识的迫切需要。结构化数据库、数据湖和仓库都提供了不同程度的解决方案。然而，一种新的、更优越的解决方案--数据湖，已经出现，它可以从从分布式来源获取的非结构化数据中提取可操作的见解。通过结合数据仓库和数据湖的优势，数据湖可以快速处理和合并数据，同时利用存储后转换和分析功能摄取和存储高速非结构化数据。Lakehouse 架构提供了实现最佳功能的必要特性，在大数据管理研究界获得了极大关注。在本文中，我们比较了数据湖、仓库和 Lakehouse 系统，强调了它们的优势和不足，确定了应对大数据管理和分析中不断变化的挑战所需的功能，并提出了一种先进的数据 Lakehouse 架构。我们还通过一项实验研究，展示了三种最先进的数据管理系统（即 HDFS 数据湖、Hive 数据仓库和 Delta Lakehouse）在管理数据以进行分析查询响应方面的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.