Xinwei Lin, Jing Zhao, Peng Di, Chuan Xiao, Rui Mao, Yan Ji, Makoto Onizuka, Zishuo Ding, Weiyi Shang, Jianbin Qin
{"title":"Automatic String Data Validation with Pattern Discovery","authors":"Xinwei Lin, Jing Zhao, Peng Di, Chuan Xiao, Rui Mao, Yan Ji, Makoto Onizuka, Zishuo Ding, Weiyi Shang, Jianbin Qin","doi":"arxiv-2408.03005","DOIUrl":null,"url":null,"abstract":"In enterprise data pipelines, data insertions occur periodically and may\nimpact downstream services if data quality issues are not addressed. Typically,\nsuch problems can be investigated and fixed by on-call engineers, but locating\nthe cause of such problems and fixing errors are often time-consuming.\nTherefore, automatic data validation is a better solution to defend the system\nand downstream services by enabling early detection of errors and providing\ndetailed error messages for quick resolution. This paper proposes a\nself-validate data management system with automatic pattern discovery\ntechniques to verify the correctness of semi-structural string data in\nenterprise data pipelines. Our solution extracts patterns from historical data\nand detects erroneous incoming data in a top-down fashion. High-level\ninformation of historical data is analyzed to discover the format skeleton of\ncorrect values. Fine-grained semantic patterns are then extracted to strike a\nbalance between generalization and specification of the discovered pattern,\nthus covering as many correct values as possible while avoiding over-fitting.\nTo tackle cold start and rapid data growth, we propose an incremental update\nstrategy and example generalization strategy. Experiments on large-scale\nindustrial and public datasets demonstrate the effectiveness and efficiency of\nour method compared to alternative solutions. Furthermore, a case study on an\nindustrial platform (Ant Group Inc.) with thousands of applications shows that\nour system captures meaningful data patterns in daily operations and helps\nengineers quickly identify errors.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03005","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In enterprise data pipelines, data insertions occur periodically and may
impact downstream services if data quality issues are not addressed. Typically,
such problems can be investigated and fixed by on-call engineers, but locating
the cause of such problems and fixing errors are often time-consuming.
Therefore, automatic data validation is a better solution to defend the system
and downstream services by enabling early detection of errors and providing
detailed error messages for quick resolution. This paper proposes a
self-validate data management system with automatic pattern discovery
techniques to verify the correctness of semi-structural string data in
enterprise data pipelines. Our solution extracts patterns from historical data
and detects erroneous incoming data in a top-down fashion. High-level
information of historical data is analyzed to discover the format skeleton of
correct values. Fine-grained semantic patterns are then extracted to strike a
balance between generalization and specification of the discovered pattern,
thus covering as many correct values as possible while avoiding over-fitting.
To tackle cold start and rapid data growth, we propose an incremental update
strategy and example generalization strategy. Experiments on large-scale
industrial and public datasets demonstrate the effectiveness and efficiency of
our method compared to alternative solutions. Furthermore, a case study on an
industrial platform (Ant Group Inc.) with thousands of applications shows that
our system captures meaningful data patterns in daily operations and helps
engineers quickly identify errors.