Matteo Interlandi, Julien Lacroix, Omar Boucelma, F. Guerra
{"title":"Cleaning MapReduce Workflows","authors":"Matteo Interlandi, Julien Lacroix, Omar Boucelma, F. Guerra","doi":"10.1109/HPCS.2017.22","DOIUrl":null,"url":null,"abstract":"Integrity constraints (ICs) such as Functional Dependencies (FDs) or Inclusion Dependencies (INDs) are commonly used in databases to check if input relations obey to certain pre-defined quality metrics. While Data-Intensive Scalable Computing (DISC) platforms such as MapReduce commonly accept as input (semi-structured) data not in relational format, still data is often transformed in key/value pairs when data is required to be re-partitioned; a process commonly referred to as shuffle. In this work, we present a Provenance-Aware model for assessing the quality of shuffled data: more precisely, we capture and model provenance using the PROV-DM W3C recommendation and we extend it with rules expressed à la Datalog to assess data quality dimensions by means of ICs metrics over DISC systems. In this way, data (and algorithmic) errors can be promptly and automatically detected without having to go through a lengthy process of output debugging.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2017.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Integrity constraints (ICs) such as Functional Dependencies (FDs) or Inclusion Dependencies (INDs) are commonly used in databases to check if input relations obey to certain pre-defined quality metrics. While Data-Intensive Scalable Computing (DISC) platforms such as MapReduce commonly accept as input (semi-structured) data not in relational format, still data is often transformed in key/value pairs when data is required to be re-partitioned; a process commonly referred to as shuffle. In this work, we present a Provenance-Aware model for assessing the quality of shuffled data: more precisely, we capture and model provenance using the PROV-DM W3C recommendation and we extend it with rules expressed à la Datalog to assess data quality dimensions by means of ICs metrics over DISC systems. In this way, data (and algorithmic) errors can be promptly and automatically detected without having to go through a lengthy process of output debugging.