企业级数据湖管理的区域参考模型

Corinna Giebler, Christoph Gröger, Eva Hoos, H. Schwarz, B. Mitschang
{"title":"企业级数据湖管理的区域参考模型","authors":"Corinna Giebler, Christoph Gröger, Eva Hoos, H. Schwarz, B. Mitschang","doi":"10.1109/EDOC49727.2020.00017","DOIUrl":null,"url":null,"abstract":"Data lakes are on the rise as data platforms for any kind of analytics, from data exploration to machine learning. They achieve the required flexibility by storing heterogeneous data in their raw format, and by avoiding the need for pre-defined use cases. However, storing only raw data is inefficient, as for many applications, the same data processing has to be applied repeatedly. To foster the reuse of processing steps, literature proposes to store data in different degrees of processing in addition to their raw format. To this end, data lakes are typically structured in zones. There exists various zone models, but they are varied, vague, and no assessments are given. It is unclear which of these zone models is applicable in a practical data lake implementation in enterprises. In this work, we assess existing zone models using requirements derived from multiple representative data analytics use cases of a real-world industry case. We identify the shortcomings of existing work and develop a zone reference model for enterprise-grade data lake management in a detailed manner. We assess the reference model’s applicability through a prototypical implementation for a real-world enterprise data lake use case. This assessment shows that the zone reference model meets the requirements relevant in practice and is ready for industry use.","PeriodicalId":409420,"journal":{"name":"2020 IEEE 24th International Enterprise Distributed Object Computing Conference (EDOC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"A Zone Reference Model for Enterprise-Grade Data Lake Management\",\"authors\":\"Corinna Giebler, Christoph Gröger, Eva Hoos, H. Schwarz, B. Mitschang\",\"doi\":\"10.1109/EDOC49727.2020.00017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data lakes are on the rise as data platforms for any kind of analytics, from data exploration to machine learning. They achieve the required flexibility by storing heterogeneous data in their raw format, and by avoiding the need for pre-defined use cases. However, storing only raw data is inefficient, as for many applications, the same data processing has to be applied repeatedly. To foster the reuse of processing steps, literature proposes to store data in different degrees of processing in addition to their raw format. To this end, data lakes are typically structured in zones. There exists various zone models, but they are varied, vague, and no assessments are given. It is unclear which of these zone models is applicable in a practical data lake implementation in enterprises. In this work, we assess existing zone models using requirements derived from multiple representative data analytics use cases of a real-world industry case. We identify the shortcomings of existing work and develop a zone reference model for enterprise-grade data lake management in a detailed manner. We assess the reference model’s applicability through a prototypical implementation for a real-world enterprise data lake use case. This assessment shows that the zone reference model meets the requirements relevant in practice and is ready for industry use.\",\"PeriodicalId\":409420,\"journal\":{\"name\":\"2020 IEEE 24th International Enterprise Distributed Object Computing Conference (EDOC)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 24th International Enterprise Distributed Object Computing Conference (EDOC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/EDOC49727.2020.00017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 24th International Enterprise Distributed Object Computing Conference (EDOC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EDOC49727.2020.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

摘要

作为从数据探索到机器学习的各种分析的数据平台,数据湖正在崛起。它们通过以原始格式存储异构数据和避免需要预定义的用例来实现所需的灵活性。但是,只存储原始数据是低效的,因为对于许多应用程序,必须重复应用相同的数据处理。为了促进处理步骤的重用,文献建议在原始格式之外,以不同程度的处理方式存储数据。为此,数据湖通常按区域进行结构化。虽然存在着各种区域模型,但它们都是多种多样的、模糊的,而且没有给出评价。目前还不清楚哪些区域模型适用于企业的实际数据湖实施。在这项工作中,我们使用来自真实行业案例的多个代表性数据分析用例的需求来评估现有的区域模型。我们发现了现有工作的不足,并详细地开发了企业级数据湖管理的区域参考模型。我们通过一个真实企业数据湖用例的原型实现来评估参考模型的适用性。评估结果显示,该区域参考模型符合相关实践要求,可供工业使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Zone Reference Model for Enterprise-Grade Data Lake Management
Data lakes are on the rise as data platforms for any kind of analytics, from data exploration to machine learning. They achieve the required flexibility by storing heterogeneous data in their raw format, and by avoiding the need for pre-defined use cases. However, storing only raw data is inefficient, as for many applications, the same data processing has to be applied repeatedly. To foster the reuse of processing steps, literature proposes to store data in different degrees of processing in addition to their raw format. To this end, data lakes are typically structured in zones. There exists various zone models, but they are varied, vague, and no assessments are given. It is unclear which of these zone models is applicable in a practical data lake implementation in enterprises. In this work, we assess existing zone models using requirements derived from multiple representative data analytics use cases of a real-world industry case. We identify the shortcomings of existing work and develop a zone reference model for enterprise-grade data lake management in a detailed manner. We assess the reference model’s applicability through a prototypical implementation for a real-world enterprise data lake use case. This assessment shows that the zone reference model meets the requirements relevant in practice and is ready for industry use.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信