CM-Explorer: Dissecting Data Ingestion Problems

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment Pub Date : 2023-08-01 DOI:10.14778/3611540.3611595

Niels Bylois, Frank Neven, Stijn Vansummeren

{"title":"CM-Explorer: Dissecting Data Ingestion Problems","authors":"Niels Bylois, Frank Neven, Stijn Vansummeren","doi":"10.14778/3611540.3611595","DOIUrl":null,"url":null,"abstract":"Data ingestion validation, the task of certifying the quality of continuously collected data, is crucial to ensure trustworthiness of analytics insights. A widely used approach for validating data quality is to specify, either manually or automatically, so-called data unit tests that check whether data quality metrics lie within expected bounds. We employ conditional unit tests based on conditional metrics (CMs) that compute data quality signals over specific parts of the ingestion data and therefore allow for a fine-grained detection of errors. A violated conditional unit test specifies a set of erroneous tuples in a natural way: the subrelation that its CM refers to. Unfortunately, the downside of their fine-grained nature is that violating unit tests are often correlated: a single error in an ingestion batch may cause multiple tests (each referring to different parts of the batch) to fail. The key challenge is therefore to untangle this correlation and filter out the most relevant violated conditional unit tests, i.e., tests that identify a core set of erroneous tuples and act as an explanation for the errors. We present CM-Explorer, a system that supports data stewards in quickly finding the most relevant violated conditional unit tests. The system consists of three components: (1) a graph explorer for visualizing the correlation structure of the violated unit tests; (2) a relation explorer for browsing the tuples selected by conditional unit tests; and, (3) a history explorer to get insight why conditional unit tests are violated. In this paper, we discuss these components and present the different scenarios that we make available for the demonstration.","PeriodicalId":54220,"journal":{"name":"Proceedings of the Vldb Endowment","volume":"37 1","pages":"0"},"PeriodicalIF":3.3000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Vldb Endowment","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3611540.3611595","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Data ingestion validation, the task of certifying the quality of continuously collected data, is crucial to ensure trustworthiness of analytics insights. A widely used approach for validating data quality is to specify, either manually or automatically, so-called data unit tests that check whether data quality metrics lie within expected bounds. We employ conditional unit tests based on conditional metrics (CMs) that compute data quality signals over specific parts of the ingestion data and therefore allow for a fine-grained detection of errors. A violated conditional unit test specifies a set of erroneous tuples in a natural way: the subrelation that its CM refers to. Unfortunately, the downside of their fine-grained nature is that violating unit tests are often correlated: a single error in an ingestion batch may cause multiple tests (each referring to different parts of the batch) to fail. The key challenge is therefore to untangle this correlation and filter out the most relevant violated conditional unit tests, i.e., tests that identify a core set of erroneous tuples and act as an explanation for the errors. We present CM-Explorer, a system that supports data stewards in quickly finding the most relevant violated conditional unit tests. The system consists of three components: (1) a graph explorer for visualizing the correlation structure of the violated unit tests; (2) a relation explorer for browsing the tuples selected by conditional unit tests; and, (3) a history explorer to get insight why conditional unit tests are violated. In this paper, we discuss these components and present the different scenarios that we make available for the demonstration.

查看原文本刊更多论文

CM-Explorer:剖析数据摄取问题

数据摄取验证，即验证持续收集的数据质量的任务，对于确保分析见解的可信度至关重要。用于验证数据质量的一种广泛使用的方法是手动或自动指定所谓的数据单元测试，以检查数据质量度量是否在预期范围内。我们采用了基于条件度量(CMs)的条件单元测试，它计算摄取数据的特定部分的数据质量信号，从而允许对错误进行细粒度检测。违反的条件单元测试以一种自然的方式指定了一组错误的元组:它的CM引用的子关系。不幸的是，它们细粒度特性的缺点是违反单元测试通常是相关的:摄取批处理中的单个错误可能导致多个测试(每个测试引用批处理的不同部分)失败。因此，关键的挑战是解开这种关联，并过滤掉最相关的违反条件的单元测试，即识别一组核心错误元组并充当错误解释的测试。我们提出CM-Explorer，一个支持数据管理员快速查找最相关的违反条件单元测试的系统。该系统由三个部分组成:(1)图形浏览器，用于可视化违反单元测试的关联结构;(2)一个关系浏览器，用于浏览条件单元测试选择的元组;(3)一个历史探索者来了解为什么违反了条件单元测试。在本文中，我们将讨论这些组件，并展示我们为演示提供的不同场景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Vldb Endowment Computer Science-General Computer Science

CiteScore

7.70

自引率

0.00%

发文量

期刊介绍： The Proceedings of the VLDB (PVLDB) welcomes original research papers on a broad range of research topics related to all aspects of data management, where systems issues play a significant role, such as data management system technology and information management infrastructures, including their very large scale of experimentation, novel architectures, and demanding applications as well as their underpinning theory. The scope of a submission for PVLDB is also described by the subject areas given below. Moreover, the scope of PVLDB is restricted to scientific areas that are covered by the combined expertise on the submission’s topic of the journal’s editorial board. Finally, the submission’s contributions should build on work already published in data management outlets, e.g., PVLDB, VLDBJ, ACM SIGMOD, IEEE ICDE, EDBT, ACM TODS, IEEE TKDE, and go beyond a syntactic citation.