The Challenge of Building Effective, Enterprise-scale Data Lakes

Awez Syed
{"title":"The Challenge of Building Effective, Enterprise-scale Data Lakes","authors":"Awez Syed","doi":"10.1145/3318464.3393816","DOIUrl":null,"url":null,"abstract":"There has been a rapid rise in the popularity of data lakes as the data infrastructure for modern analytics and data science. The combination of cloud storage and fast, elastic processing provides an inexpensive and scalable solution for building analytical applications. While data lakes make it easy to ingest and store vast amounts of data, the ability to effectively make use of that data is still limited. This data often lacks context, doesn't meet the quality required for applications, and is not easily understandable or discoverable by users. Problems of data consistency and accuracy make it hard to derive value from data lakes and to trust the analytics based on this data. The traditional methods of manually documenting, classifying and assessing the data don't scale to the volume of cloud-based data lakes. New automated, learning-based approaches are required to discover, curate and make the data usable for a wide variety of users. In this talk, we describe the real-world implementation patterns of data lakes and give an overview of the many open challenges in deploying successful, enterprise-scale data lakes.","PeriodicalId":436122,"journal":{"name":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3318464.3393816","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

There has been a rapid rise in the popularity of data lakes as the data infrastructure for modern analytics and data science. The combination of cloud storage and fast, elastic processing provides an inexpensive and scalable solution for building analytical applications. While data lakes make it easy to ingest and store vast amounts of data, the ability to effectively make use of that data is still limited. This data often lacks context, doesn't meet the quality required for applications, and is not easily understandable or discoverable by users. Problems of data consistency and accuracy make it hard to derive value from data lakes and to trust the analytics based on this data. The traditional methods of manually documenting, classifying and assessing the data don't scale to the volume of cloud-based data lakes. New automated, learning-based approaches are required to discover, curate and make the data usable for a wide variety of users. In this talk, we describe the real-world implementation patterns of data lakes and give an overview of the many open challenges in deploying successful, enterprise-scale data lakes.
构建有效的企业级数据湖的挑战
作为现代分析和数据科学的数据基础设施,数据湖的普及程度迅速上升。云存储和快速弹性处理的结合为构建分析应用程序提供了一种廉价且可扩展的解决方案。虽然数据湖使摄取和存储大量数据变得容易,但有效利用这些数据的能力仍然有限。这些数据通常缺乏上下文,不符合应用程序所需的质量,并且不容易被用户理解或发现。数据一致性和准确性的问题使得很难从数据湖中获得价值,也很难信任基于这些数据的分析。手工记录、分类和评估数据的传统方法无法适应基于云的数据湖的规模。需要新的自动化的、基于学习的方法来发现、整理和使数据可供各种各样的用户使用。在这次演讲中,我们描述了数据湖的实际实现模式,并概述了部署成功的企业级数据湖的许多开放挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信