{"title":"Towards Transparent Data Cleaning: The Data Cleaning Model Explorer (DCM/X)","authors":"Nikolaus Nova Parulian, Bertram Ludäscher","doi":"10.1109/JCDL52503.2021.00054","DOIUrl":null,"url":null,"abstract":"To make data cleaning processes more transparent, we have developed DCM, a data cleaning model that can represent different kinds of provenance information from tools such as OpenRefine. The information in DCM captures the data cleaning history D0 ↝ Dn, i.e., how an input dataset D0 was transformed, through a number of data cleaning transformations, into a “clean” dataset Dn. Here we demonstrate a Python-based toolkit for OpenRefine that allows users to (i) harvest provenance information from previously executed data cleaning recipes and internal project files, (ii) load this information into a DCM database, and then (iii) explore the data lineage and processing history of Dn using provenance queries and visualizations. The provenance information contained in DCM, and in the views and query results over DCM, turns otherwise opaque data cleaning processes into transparent data cleaning workflows suitable for archival, sharing, and reuse.","PeriodicalId":112400,"journal":{"name":"2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCDL52503.2021.00054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
To make data cleaning processes more transparent, we have developed DCM, a data cleaning model that can represent different kinds of provenance information from tools such as OpenRefine. The information in DCM captures the data cleaning history D0 ↝ Dn, i.e., how an input dataset D0 was transformed, through a number of data cleaning transformations, into a “clean” dataset Dn. Here we demonstrate a Python-based toolkit for OpenRefine that allows users to (i) harvest provenance information from previously executed data cleaning recipes and internal project files, (ii) load this information into a DCM database, and then (iii) explore the data lineage and processing history of Dn using provenance queries and visualizations. The provenance information contained in DCM, and in the views and query results over DCM, turns otherwise opaque data cleaning processes into transparent data cleaning workflows suitable for archival, sharing, and reuse.