Automatic String Data Validation with Pattern Discovery

arXiv - CS - Databases Pub Date : 2024-08-06 DOI:arxiv-2408.03005

Xinwei Lin, Jing Zhao, Peng Di, Chuan Xiao, Rui Mao, Yan Ji, Makoto Onizuka, Zishuo Ding, Weiyi Shang, Jianbin Qin

{"title":"Automatic String Data Validation with Pattern Discovery","authors":"Xinwei Lin, Jing Zhao, Peng Di, Chuan Xiao, Rui Mao, Yan Ji, Makoto Onizuka, Zishuo Ding, Weiyi Shang, Jianbin Qin","doi":"arxiv-2408.03005","DOIUrl":null,"url":null,"abstract":"In enterprise data pipelines, data insertions occur periodically and may\nimpact downstream services if data quality issues are not addressed. Typically,\nsuch problems can be investigated and fixed by on-call engineers, but locating\nthe cause of such problems and fixing errors are often time-consuming.\nTherefore, automatic data validation is a better solution to defend the system\nand downstream services by enabling early detection of errors and providing\ndetailed error messages for quick resolution. This paper proposes a\nself-validate data management system with automatic pattern discovery\ntechniques to verify the correctness of semi-structural string data in\nenterprise data pipelines. Our solution extracts patterns from historical data\nand detects erroneous incoming data in a top-down fashion. High-level\ninformation of historical data is analyzed to discover the format skeleton of\ncorrect values. Fine-grained semantic patterns are then extracted to strike a\nbalance between generalization and specification of the discovered pattern,\nthus covering as many correct values as possible while avoiding over-fitting.\nTo tackle cold start and rapid data growth, we propose an incremental update\nstrategy and example generalization strategy. Experiments on large-scale\nindustrial and public datasets demonstrate the effectiveness and efficiency of\nour method compared to alternative solutions. Furthermore, a case study on an\nindustrial platform (Ant Group Inc.) with thousands of applications shows that\nour system captures meaningful data patterns in daily operations and helps\nengineers quickly identify errors.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":"22 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consuming. Therefore, automatic data validation is a better solution to defend the system and downstream services by enabling early detection of errors and providing detailed error messages for quick resolution. This paper proposes a self-validate data management system with automatic pattern discovery techniques to verify the correctness of semi-structural string data in enterprise data pipelines. Our solution extracts patterns from historical data and detects erroneous incoming data in a top-down fashion. High-level information of historical data is analyzed to discover the format skeleton of correct values. Fine-grained semantic patterns are then extracted to strike a balance between generalization and specification of the discovered pattern, thus covering as many correct values as possible while avoiding over-fitting. To tackle cold start and rapid data growth, we propose an incremental update strategy and example generalization strategy. Experiments on large-scale industrial and public datasets demonstrate the effectiveness and efficiency of our method compared to alternative solutions. Furthermore, a case study on an industrial platform (Ant Group Inc.) with thousands of applications shows that our system captures meaningful data patterns in daily operations and helps engineers quickly identify errors.

查看原文本刊更多论文

利用模式发现自动验证字符串数据

在企业数据管道中，数据插入会定期发生，如果不解决数据质量问题，可能会影响下游服务。因此，自动数据验证是保护系统和下游服务的更好解决方案，它能及早发现错误，并提供详细的错误信息以便快速解决。本文提出了一种具有自动模式发现技术的自我验证数据管理系统，用于验证企业数据管道中半结构字符串数据的正确性。我们的解决方案以自上而下的方式从历史数据中提取模式，并检测错误的传入数据。通过分析历史数据的高级信息来发现正确值的格式骨架。为了解决冷启动和数据快速增长问题，我们提出了增量更新策略和示例泛化策略。在大规模工业和公共数据集上的实验证明，与其他解决方案相比，我们的方法是有效和高效的。此外，在一个拥有数千个应用程序的工业平台（蚂蚁金服集团公司）上进行的案例研究表明，我们的系统可以捕捉到日常运营中有意义的数据模式，并帮助工程师快速识别错误。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Databases

自引率

0.00%

发文量