DeepClean: Data Cleaning via Question Asking

2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) Pub Date : 2018-10-01 DOI:10.1109/DSAA.2018.00039

Xinyang Zhang, Yujie Ji, Chanh Nguyen, Ting Wang

{"title":"DeepClean: Data Cleaning via Question Asking","authors":"Xinyang Zhang, Yujie Ji, Chanh Nguyen, Ting Wang","doi":"10.1109/DSAA.2018.00039","DOIUrl":null,"url":null,"abstract":"As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.","PeriodicalId":208455,"journal":{"name":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","volume":"32 9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSAA.2018.00039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.

查看原文本刊更多论文

DeepClean:通过提问进行数据清理

作为数据分析管道中的一项关键任务，数据清理是一项非常耗费人力且容易出错的工作。事实证明，知识库辅助的数据清理是发现和修复数据缺陷的强大工具;然而，它的适用性不可避免地受到知识库的自然限制。同时，尽管大量的知识来源以自由文本语料库的形式存在(例如，Wikipedia)，但将它们转换成现有数据清理工具可用的格式可能代价高昂且容易出错，如果不是根本不可能的话。在这里，我们介绍了DeepClean，这是第一个由自由文本知识库驱动的端到端数据清理框架。在高层次上，DeepClean通过其问答(QA)界面利用知识来源，并通过迭代提问实现高质量的清洁。具体来说，DeepClean分三个阶段检测和修复数据缺陷:(i)模式提取——它自动发现数据属性的语义类型及其相关性;(ii)问题生成-将每个数据元组转换为最小的验证问题集;(iii)补全和修复-通过将知识来源返回的答案与数据值进行核对，确定错误情况并提出可能的修复建议。通过广泛的实证研究，我们证明了DeepClean适用于一系列领域，并且可以有效地修复各种数据缺陷，突出了由自由文本知识库驱动的数据清洗是未来研究的一个有希望的方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA)

自引率

0.00%

发文量