Data Ingestion Validation Through Stable Conditional Metrics with Ranking and Filtering

IF 6.9 3区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Niels Bylois, Frank Neven, Stijn Vansummeren
{"title":"Data Ingestion Validation Through Stable Conditional Metrics with Ranking and Filtering","authors":"Niels Bylois, Frank Neven, Stijn Vansummeren","doi":"10.1007/s10796-024-10504-y","DOIUrl":null,"url":null,"abstract":"<p>We introduce an advanced method for validating data quality, which is crucial for ensuring reliable analytics insights. Traditional data quality validation relies on data unit tests, which use global metrics to determine if data quality falls within expected ranges. Unfortunately, these existing approaches suffer from two limitations. Firstly, they offer only coarse-grained assessments, missing fine-grained errors. Secondly, they fail to pinpoint the specific data causing test failures. To address these issues, we propose a novel approach using conditional metrics, enabling more detailed analysis than global metrics. Our method involves two stages: unit test discovery and monitoring/error identification. In the discovery phase, we derive conditional metric-based unit tests from historical data, focusing on stability to select appropriate metrics. The monitoring phase involves using these tests for new data batches, with conditional metrics helping us identify potential errors. We validate the effectiveness of this approach using two datasets and seven synthetic error scenarios, showing significant improvements over global metrics and promising results in fine-grained error detection for data ingestion validation.</p>","PeriodicalId":13610,"journal":{"name":"Information Systems Frontiers","volume":"22 1","pages":""},"PeriodicalIF":6.9000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems Frontiers","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10796-024-10504-y","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

We introduce an advanced method for validating data quality, which is crucial for ensuring reliable analytics insights. Traditional data quality validation relies on data unit tests, which use global metrics to determine if data quality falls within expected ranges. Unfortunately, these existing approaches suffer from two limitations. Firstly, they offer only coarse-grained assessments, missing fine-grained errors. Secondly, they fail to pinpoint the specific data causing test failures. To address these issues, we propose a novel approach using conditional metrics, enabling more detailed analysis than global metrics. Our method involves two stages: unit test discovery and monitoring/error identification. In the discovery phase, we derive conditional metric-based unit tests from historical data, focusing on stability to select appropriate metrics. The monitoring phase involves using these tests for new data batches, with conditional metrics helping us identify potential errors. We validate the effectiveness of this approach using two datasets and seven synthetic error scenarios, showing significant improvements over global metrics and promising results in fine-grained error detection for data ingestion validation.

Abstract Image

通过具有排序和过滤功能的稳定条件度量对数据输入进行验证
我们介绍了一种先进的数据质量验证方法,这对于确保可靠的分析洞察力至关重要。传统的数据质量验证依赖于数据单元测试,这些测试使用全局指标来确定数据质量是否在预期范围内。遗憾的是,这些现有方法存在两个局限性。首先,它们只能提供粗粒度的评估,缺少细粒度的误差。其次,它们无法精确定位导致测试失败的特定数据。为了解决这些问题,我们提出了一种使用条件度量的新方法,它能比全局度量进行更详细的分析。我们的方法包括两个阶段:单元测试发现和监控/错误识别。在发现阶段,我们从历史数据中推导出基于条件指标的单元测试,重点关注稳定性以选择合适的指标。在监控阶段,我们将这些测试用于新的数据批次,并利用条件指标帮助我们识别潜在的错误。我们使用两个数据集和七个合成错误场景验证了这种方法的有效性,结果表明它比全局度量方法有显著改进,在数据摄取验证的细粒度错误检测方面也取得了可喜成果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information Systems Frontiers
Information Systems Frontiers 工程技术-计算机:理论方法
CiteScore
13.30
自引率
18.60%
发文量
127
审稿时长
9 months
期刊介绍: The interdisciplinary interfaces of Information Systems (IS) are fast emerging as defining areas of research and development in IS. These developments are largely due to the transformation of Information Technology (IT) towards networked worlds and its effects on global communications and economies. While these developments are shaping the way information is used in all forms of human enterprise, they are also setting the tone and pace of information systems of the future. The major advances in IT such as client/server systems, the Internet and the desktop/multimedia computing revolution, for example, have led to numerous important vistas of research and development with considerable practical impact and academic significance. While the industry seeks to develop high performance IS/IT solutions to a variety of contemporary information support needs, academia looks to extend the reach of IS technology into new application domains. Information Systems Frontiers (ISF) aims to provide a common forum of dissemination of frontline industrial developments of substantial academic value and pioneering academic research of significant practical impact.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信