Data Validation Utilizing Expert Knowledge and Shape Constraints

Journal of Data and Information Quality Pub Date : 2024-05-11 DOI:10.1145/3661826

F. Bachinger, Lisa Ehrlinger, G. Kronberger, Wolfram Wöß

{"title":"Data Validation Utilizing Expert Knowledge and Shape Constraints","authors":"F. Bachinger, Lisa Ehrlinger, G. Kronberger, Wolfram Wöß","doi":"10.1145/3661826","DOIUrl":null,"url":null,"abstract":"Data validation is a primary concern in any data-driven application, as undetected data errors may negatively affect machine learning models and lead to suboptimal decisions. Data quality issues are usually detected manually by experts, which becomes infeasible and uneconomical for large volumes of data.\n To enable automated data validation, we propose “shape constraint-based data validation”, a novel approach based on machine learning that incorporates expert knowledge in the form of shape constraints. Shape constraints can be used to describe expected (multivariate and nonlinear) patterns in valid data, and enable the detection of invalid data which deviates from these expected patterns. Our approach can be divided into two steps: (1) shape-constrained prediction models are trained on data, and (2) their training error is analyzed to identify invalid data. The training error can be used as an indicator for invalid data because shape-constrained models can fit valid data better than invalid data.\n We evaluate the approach on a benchmark suite consisting of synthetic datasets, which we have published for benchmarking similar data validation approaches. Additionally, we demonstrate the capabilities of the proposed approach with a real-world dataset consisting of measurements from a friction test bench in an industrial setting. Our approach detects subtle data errors that are difficult to identify even for domain experts.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":" 1163","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3661826","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Data validation is a primary concern in any data-driven application, as undetected data errors may negatively affect machine learning models and lead to suboptimal decisions. Data quality issues are usually detected manually by experts, which becomes infeasible and uneconomical for large volumes of data. To enable automated data validation, we propose “shape constraint-based data validation”, a novel approach based on machine learning that incorporates expert knowledge in the form of shape constraints. Shape constraints can be used to describe expected (multivariate and nonlinear) patterns in valid data, and enable the detection of invalid data which deviates from these expected patterns. Our approach can be divided into two steps: (1) shape-constrained prediction models are trained on data, and (2) their training error is analyzed to identify invalid data. The training error can be used as an indicator for invalid data because shape-constrained models can fit valid data better than invalid data. We evaluate the approach on a benchmark suite consisting of synthetic datasets, which we have published for benchmarking similar data validation approaches. Additionally, we demonstrate the capabilities of the proposed approach with a real-world dataset consisting of measurements from a friction test bench in an industrial setting. Our approach detects subtle data errors that are difficult to identify even for domain experts.

查看原文本刊更多论文

利用专家知识和形状限制进行数据验证

数据验证是任何数据驱动型应用的首要问题，因为未检测到的数据错误可能会对机器学习模型产生负面影响，并导致次优决策。数据质量问题通常由专家手动检测，这对于海量数据来说既不可行也不经济。为了实现自动数据验证，我们提出了 "基于形状约束的数据验证"，这是一种基于机器学习的新方法，它以形状约束的形式纳入了专家知识。形状约束可用于描述有效数据中的预期（多变量和非线性）模式，并能检测出偏离这些预期模式的无效数据。我们的方法可分为两个步骤：(1) 在数据上训练形状约束预测模型，(2) 分析其训练误差以识别无效数据。训练误差可作为无效数据的指标，因为形状约束模型比无效数据更适合有效数据。我们在一个由合成数据集组成的基准套件上对该方法进行了评估。此外，我们还利用由工业环境中摩擦测试台的测量数据组成的真实数据集演示了所提方法的能力。我们的方法能检测出即使是领域专家也难以识别的细微数据错误。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Data and Information Quality

自引率

0.00%

发文量