CLAMS: Bringing Quality to Data Lakes

Proceedings of the 2016 International Conference on Management of Data Pub Date : 2016-06-26 DOI:10.1145/2882903.2899391

Mina H. Farid, Alexandra Roatis, I. Ilyas, H. Hoffmann, Xu Chu

{"title":"CLAMS: Bringing Quality to Data Lakes","authors":"Mina H. Farid, Alexandra Roatis, I. Ilyas, H. Hoffmann, Xu Chu","doi":"10.1145/2882903.2899391","DOIUrl":null,"url":null,"abstract":"With the increasing incentive of enterprises to ingest as much data as they can in what is commonly referred to as \"data lakes\", and with the recent development of multiple technologies to support this \"load-first\" paradigm, the new environment presents serious data management challenges. Among them, the assessment of data quality and cleaning large volumes of heterogeneous data sources become essential tasks in unveiling the value of big data. The coveted use of unstructured and semi-structured data in large volumes makes current data cleaning tools (primarily designed for relational data) not directly adoptable. We present CLAMS, a system to discover and enforce expressive integrity constraints from large amounts of lake data with very limited schema information (e.g., represented as RDF triples). This demonstration shows how CLAMS is able to discover the constraints and the schemas they are defined on simultaneously. CLAMS also introduces a scale-out solution to efficiently detect errors in the raw data. CLAMS interacts with human experts to both validate the discovered constraints and to suggest data repairs. CLAMS has been deployed in a real large-scale enterprise data lake and was experimented with a real data set of 1.2 billion triples. It has been able to spot multiple obscure data inconsistencies and errors early in the data processing stack, providing huge value to the enterprise.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":"41 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"72","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2882903.2899391","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 72

Abstract

With the increasing incentive of enterprises to ingest as much data as they can in what is commonly referred to as "data lakes", and with the recent development of multiple technologies to support this "load-first" paradigm, the new environment presents serious data management challenges. Among them, the assessment of data quality and cleaning large volumes of heterogeneous data sources become essential tasks in unveiling the value of big data. The coveted use of unstructured and semi-structured data in large volumes makes current data cleaning tools (primarily designed for relational data) not directly adoptable. We present CLAMS, a system to discover and enforce expressive integrity constraints from large amounts of lake data with very limited schema information (e.g., represented as RDF triples). This demonstration shows how CLAMS is able to discover the constraints and the schemas they are defined on simultaneously. CLAMS also introduces a scale-out solution to efficiently detect errors in the raw data. CLAMS interacts with human experts to both validate the discovered constraints and to suggest data repairs. CLAMS has been deployed in a real large-scale enterprise data lake and was experimented with a real data set of 1.2 billion triples. It has been able to spot multiple obscure data inconsistencies and errors early in the data processing stack, providing huge value to the enterprise.

查看原文本刊更多论文

蛤蜊:为数据湖带来质量

随着企业越来越多地在通常被称为“数据湖”的地方摄取尽可能多的数据，以及最近支持这种“负载优先”范式的多种技术的发展，新环境提出了严峻的数据管理挑战。其中，数据质量评估和海量异构数据源清理成为揭示大数据价值的重要任务。对大量非结构化和半结构化数据的渴望使得当前的数据清理工具(主要是为关系数据设计的)不能直接采用。我们提出了CLAMS，这是一个系统，它可以用非常有限的模式信息(例如，用RDF三元组表示)从大量湖泊数据中发现和执行表达性完整性约束。这个演示展示了CLAMS如何能够同时发现约束和它们所定义的模式。CLAMS还引入了横向扩展解决方案，以有效地检测原始数据中的错误。CLAMS与人类专家进行交互，以验证发现的约束并建议数据修复。CLAMS已经部署在一个真正的大型企业数据湖中，并在一个包含12亿个三元组的真实数据集上进行了实验。它能够在数据处理堆栈的早期发现多个模糊的数据不一致和错误，为企业提供了巨大的价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 International Conference on Management of Data

自引率

0.00%

发文量