QDflows

Journal of Data and Information Quality (JDIQ) Pub Date : 2017-06-30 DOI:10.1145/3064173

Sabrina Abdellaoui, Fahima Nader, R. Chalal

{"title":"QDflows","authors":"Sabrina Abdellaoui, Fahima Nader, R. Chalal","doi":"10.1145/3064173","DOIUrl":null,"url":null,"abstract":"In the big data era, data integration is becoming increasingly important. It is usually handled by data flows processes that extract, transform, and clean data from several sources, and populate the data integration system (DIS). Designing data flows is facing several challenges. In this article, we deal with data quality issues such as (1) specifying a set of quality rules, (2) enforcing them on the data flow pipeline to detect violations, and (3) producing accurate repairs for the detected violations. We propose QDflows, a system for designing quality-aware data flows that considers the following as input: (1) a high-quality knowledge base (KB) as the global schema of integration, (2) a set of data sources and a set of validated users’ requirements, (3) a set of defined mappings between data sources and the KB, and (4) a set of quality rules specified by users. QDflows uses an ontology to design the DIS schema. It offers the ability to define the DIS ontology as a module of the knowledge base, based on validated users’ requirements. The DIS ontology model is then extended with multiple types of quality rules specified by users. QDflows extracts and transforms data from sources to populate the DIS. It detects violations of quality rules enforced on the data flows, constructs repair patterns, searches for horizontal and vertical matches in the knowledge base, and performs an automatic repair when possible or generates possible repairs. It interactively involves users to validate the repair process before loading the clean data into the DIS. Using real-life and synthetic datasets, the DBpedia and Yago knowledge bases, we experimentally evaluate the generality, effectiveness, and efficiency of QDflows. We also showcase an interactive tool implementing our system.","PeriodicalId":15582,"journal":{"name":"Journal of Data and Information Quality (JDIQ)","volume":"10 1","pages":"1 - 39"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality (JDIQ)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3064173","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In the big data era, data integration is becoming increasingly important. It is usually handled by data flows processes that extract, transform, and clean data from several sources, and populate the data integration system (DIS). Designing data flows is facing several challenges. In this article, we deal with data quality issues such as (1) specifying a set of quality rules, (2) enforcing them on the data flow pipeline to detect violations, and (3) producing accurate repairs for the detected violations. We propose QDflows, a system for designing quality-aware data flows that considers the following as input: (1) a high-quality knowledge base (KB) as the global schema of integration, (2) a set of data sources and a set of validated users’ requirements, (3) a set of defined mappings between data sources and the KB, and (4) a set of quality rules specified by users. QDflows uses an ontology to design the DIS schema. It offers the ability to define the DIS ontology as a module of the knowledge base, based on validated users’ requirements. The DIS ontology model is then extended with multiple types of quality rules specified by users. QDflows extracts and transforms data from sources to populate the DIS. It detects violations of quality rules enforced on the data flows, constructs repair patterns, searches for horizontal and vertical matches in the knowledge base, and performs an automatic repair when possible or generates possible repairs. It interactively involves users to validate the repair process before loading the clean data into the DIS. Using real-life and synthetic datasets, the DBpedia and Yago knowledge bases, we experimentally evaluate the generality, effectiveness, and efficiency of QDflows. We also showcase an interactive tool implementing our system.

查看原文本刊更多论文

在大数据时代，数据集成变得越来越重要。它通常由数据流流程处理，这些流程从多个数据源提取、转换和清理数据，并填充数据集成系统(DIS)。设计数据流面临着几个挑战。在本文中，我们处理数据质量问题，例如(1)指定一组质量规则，(2)在数据流管道上执行这些规则以检测违规，以及(3)为检测到的违规生成准确的修复。我们提出QDflows，这是一个用于设计质量感知数据流的系统，它将以下内容作为输入:(1)高质量知识库(KB)作为集成的全局模式，(2)一组数据源和一组经过验证的用户需求，(3)一组数据源和知识库之间定义的映射，以及(4)一组用户指定的质量规则。QDflows使用本体来设计DIS模式。它提供了将DIS本体定义为知识库模块的能力，该模块基于经过验证的用户需求。然后使用用户指定的多种类型的质量规则扩展DIS本体模型。QDflows从数据源中提取并转换数据以填充DIS。它可以检测数据流中执行的质量规则违反情况，构建修复模式，在知识库中搜索水平和垂直匹配，并在可能的情况下执行自动修复或生成可能的修复。在将清洁数据加载到DIS之前，它交互式地让用户验证修复过程。使用真实的和合成的数据集，DBpedia和Yago知识库，我们通过实验评估了qdflow的通用性、有效性和效率。我们还展示了一个实现我们系统的交互式工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Data and Information Quality (JDIQ)

自引率

0.00%

发文量