Constance: An Intelligent Data Lake System

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI:10.1145/2882903.2899389

Rihan Hai, Sandra Geisler, C. Quix

{"title":"Constance: An Intelligent Data Lake System","authors":"Rihan Hai, Sandra Geisler, C. Quix","doi":"10.1145/2882903.2899389","DOIUrl":null,"url":null,"abstract":"As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. Data Lake systems have been proposed as a solution to this problem, by providing a schema-less repository for raw data with a common access interface. However, just dumping all data into a data lake without any metadata management, would only lead to a 'data swamp'. To avoid this, we propose Constance, a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources. Constance discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities. With embedded query rewriting engines supporting structured data and semi-structured data, Constance provides users a unified interface for query processing and data exploration. During the demo, we will walk through each functional component of Constance. Constance will be applied to two real-life use cases in order to show attendees the importance and usefulness of our generic and extensible data lake system.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"14 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"199","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2882903.2899389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 199

Abstract

As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. Data Lake systems have been proposed as a solution to this problem, by providing a schema-less repository for raw data with a common access interface. However, just dumping all data into a data lake without any metadata management, would only lead to a 'data swamp'. To avoid this, we propose Constance, a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources. Constance discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities. With embedded query rewriting engines supporting structured data and semi-structured data, Constance provides users a unified interface for query processing and data exploration. During the demo, we will walk through each functional component of Constance. Constance will be applied to two real-life use cases in order to show attendees the importance and usefulness of our generic and extensible data lake system.

查看原文本刊更多论文

康斯坦斯:智能数据湖系统

作为我们时代的挑战，大数据的研究仍然存在许多问题，尤其是数据的多样性。数据源的高度多样性通常会导致信息孤岛，这是一组具有异构模式、查询语言和api的非集成数据管理系统。数据湖系统已经被提出作为解决这个问题的方案，它为原始数据提供一个无模式的存储库，并提供一个通用的访问接口。然而，仅仅将所有数据倾倒到数据湖中而不进行任何元数据管理，只会导致“数据沼泽”。为了避免这种情况，我们提出Constance，这是一个数据湖系统，对从异构数据源提取的原始数据进行了复杂的元数据管理。Constance从数据源中发现、提取和总结结构化元数据，并用语义信息对数据和元数据进行标注，避免歧义。通过支持结构化数据和半结构化数据的嵌入式查询重写引擎，Constance为用户提供了查询处理和数据探索的统一界面。在演示过程中，我们将介绍Constance的每个功能组件。Constance将应用于两个现实生活中的用例，以向与会者展示我们的通用和可扩展数据湖系统的重要性和有用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 International Conference on Management of Data

自引率

0.00%

发文量