Towards Transparent Data Cleaning: The Data Cleaning Model Explorer (DCM/X)

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL) Pub Date : 2021-09-01 DOI:10.1109/JCDL52503.2021.00054

Nikolaus Nova Parulian, Bertram Ludäscher

引用次数: 2

Abstract

To make data cleaning processes more transparent, we have developed DCM, a data cleaning model that can represent different kinds of provenance information from tools such as OpenRefine. The information in DCM captures the data cleaning history D0 ↝ Dn, i.e., how an input dataset D0 was transformed, through a number of data cleaning transformations, into a “clean” dataset Dn. Here we demonstrate a Python-based toolkit for OpenRefine that allows users to (i) harvest provenance information from previously executed data cleaning recipes and internal project files, (ii) load this information into a DCM database, and then (iii) explore the data lineage and processing history of Dn using provenance queries and visualizations. The provenance information contained in DCM, and in the views and query results over DCM, turns otherwise opaque data cleaning processes into transparent data cleaning workflows suitable for archival, sharing, and reuse.

查看原文本刊更多论文

迈向透明的数据清洗:数据清洗模型浏览器(DCM/X)

为了使数据清理过程更加透明，我们开发了DCM，这是一种数据清理模型，可以表示来自OpenRefine等工具的不同种类的来源信息。DCM中的信息捕获了数据清理历史D0的* * * Dn，即输入数据集D0如何通过一系列数据清理转换转换为“干净的”数据集Dn。在这里，我们展示了一个基于python的OpenRefine工具包，它允许用户(i)从以前执行的数据清理食谱和内部项目文件中获取来源信息，(ii)将这些信息加载到DCM数据库中，然后(iii)使用来源查询和可视化来探索Dn的数据沿袭和处理历史。DCM以及DCM上的视图和查询结果中包含的来源信息，将原本不透明的数据清理流程转变为透明的数据清理工作流，适合归档、共享和重用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)

自引率

0.00%

发文量