Cleaning MapReduce Workflows

2017 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2017-07-01 DOI:10.1109/HPCS.2017.22

Matteo Interlandi, Julien Lacroix, Omar Boucelma, F. Guerra

引用次数: 0

Abstract

Integrity constraints (ICs) such as Functional Dependencies (FDs) or Inclusion Dependencies (INDs) are commonly used in databases to check if input relations obey to certain pre-defined quality metrics. While Data-Intensive Scalable Computing (DISC) platforms such as MapReduce commonly accept as input (semi-structured) data not in relational format, still data is often transformed in key/value pairs when data is required to be re-partitioned; a process commonly referred to as shuffle. In this work, we present a Provenance-Aware model for assessing the quality of shuffled data: more precisely, we capture and model provenance using the PROV-DM W3C recommendation and we extend it with rules expressed à la Datalog to assess data quality dimensions by means of ICs metrics over DISC systems. In this way, data (and algorithmic) errors can be promptly and automatically detected without having to go through a lengthy process of output debugging.

查看原文本刊更多论文

清理MapReduce工作流

完整性约束(ic)，如功能依赖关系(fd)或包含依赖关系(ind)，通常用于数据库中，以检查输入关系是否符合某些预定义的质量度量标准。虽然像MapReduce这样的数据密集型可扩展计算(DISC)平台通常接受非关系格式的输入(半结构化)数据，但当数据需要重新分区时，仍然经常以键/值对的形式转换数据;这个过程通常被称为洗牌。在这项工作中，我们提出了一个用于评估改组数据质量的溯源感知模型:更准确地说，我们使用provo - dm W3C推荐来捕获和建模溯源，并使用la Datalog表示的规则对其进行扩展，以通过DISC系统上的ic指标来评估数据质量维度。通过这种方式，数据(和算法)错误可以迅速、自动地检测出来，而不必经历冗长的输出调试过程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量