The Challenge of Building Effective, Enterprise-scale Data Lakes

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data Pub Date : 2020-05-29 DOI:10.1145/3318464.3393816

Awez Syed

{"title":"The Challenge of Building Effective, Enterprise-scale Data Lakes","authors":"Awez Syed","doi":"10.1145/3318464.3393816","DOIUrl":null,"url":null,"abstract":"There has been a rapid rise in the popularity of data lakes as the data infrastructure for modern analytics and data science. The combination of cloud storage and fast, elastic processing provides an inexpensive and scalable solution for building analytical applications. While data lakes make it easy to ingest and store vast amounts of data, the ability to effectively make use of that data is still limited. This data often lacks context, doesn't meet the quality required for applications, and is not easily understandable or discoverable by users. Problems of data consistency and accuracy make it hard to derive value from data lakes and to trust the analytics based on this data. The traditional methods of manually documenting, classifying and assessing the data don't scale to the volume of cloud-based data lakes. New automated, learning-based approaches are required to discover, curate and make the data usable for a wide variety of users. In this talk, we describe the real-world implementation patterns of data lakes and give an overview of the many open challenges in deploying successful, enterprise-scale data lakes.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3318464.3393816","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

There has been a rapid rise in the popularity of data lakes as the data infrastructure for modern analytics and data science. The combination of cloud storage and fast, elastic processing provides an inexpensive and scalable solution for building analytical applications. While data lakes make it easy to ingest and store vast amounts of data, the ability to effectively make use of that data is still limited. This data often lacks context, doesn't meet the quality required for applications, and is not easily understandable or discoverable by users. Problems of data consistency and accuracy make it hard to derive value from data lakes and to trust the analytics based on this data. The traditional methods of manually documenting, classifying and assessing the data don't scale to the volume of cloud-based data lakes. New automated, learning-based approaches are required to discover, curate and make the data usable for a wide variety of users. In this talk, we describe the real-world implementation patterns of data lakes and give an overview of the many open challenges in deploying successful, enterprise-scale data lakes.

查看原文本刊更多论文

构建有效的企业级数据湖的挑战

作为现代分析和数据科学的数据基础设施，数据湖的普及程度迅速上升。云存储和快速弹性处理的结合为构建分析应用程序提供了一种廉价且可扩展的解决方案。虽然数据湖使摄取和存储大量数据变得容易，但有效利用这些数据的能力仍然有限。这些数据通常缺乏上下文，不符合应用程序所需的质量，并且不容易被用户理解或发现。数据一致性和准确性的问题使得很难从数据湖中获得价值，也很难信任基于这些数据的分析。手工记录、分类和评估数据的传统方法无法适应基于云的数据湖的规模。需要新的自动化的、基于学习的方法来发现、整理和使数据可供各种各样的用户使用。在这次演讲中，我们描述了数据湖的实际实现模式，并概述了部署成功的企业级数据湖的许多开放挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

自引率

0.00%

发文量