Data Sentinel: A Declarative Production-Scale Data Validation Platform

A. Swami, Sriram Vasudevan, Joojay Huyn
{"title":"Data Sentinel: A Declarative Production-Scale Data Validation Platform","authors":"A. Swami, Sriram Vasudevan, Joojay Huyn","doi":"10.1109/ICDE48307.2020.00140","DOIUrl":null,"url":null,"abstract":"Many organizations process big data for important business operations and decisions. Hence, data quality greatly affects their success. Data quality problems continue to be widespread, costing US businesses an estimated $600 billion annually. To date, addressing data quality in production environments still poses many challenges: easily defining properties of high-quality data; validating production-scale data in a timely manner; debugging poor quality data; designing data quality solutions to be easy to use, understand, and operate; and designing data quality solutions to easily integrate with other systems. Current data validation solutions do not comprehensively address these challenges. To address data quality in production environments at LinkedIn, we developed Data Sentinel, a declarative production-scale data validation platform. In a simple and well-structured configuration, users declaratively specify the desired data checks. Then, Data Sentinel performs these data checks and writes the results to an easily understandable report. Furthermore, Data Sentinel provides well-defined schemas for the configuration and report. This makes it easy for other systems to interface or integrate with Data Sentinel. To make Data Sentinel even easier to use, understand, and operate in production environments, we provide Data Sentinel Service (DSS), a complementary system to help specify data checks, schedule, deploy, and tune data validation jobs, and understand data checking results. The contributions of this paper include the following: 1) Data Sentinel, a declarative production-scale data validation platform successfully deployed at LinkedIn 2) A generic design to build and deploy similar systems for production environments 3) Experiences and lessons learned that can benefit practitioners with similar objectives.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"13 1","pages":"1579-1590"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE48307.2020.00140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

Many organizations process big data for important business operations and decisions. Hence, data quality greatly affects their success. Data quality problems continue to be widespread, costing US businesses an estimated $600 billion annually. To date, addressing data quality in production environments still poses many challenges: easily defining properties of high-quality data; validating production-scale data in a timely manner; debugging poor quality data; designing data quality solutions to be easy to use, understand, and operate; and designing data quality solutions to easily integrate with other systems. Current data validation solutions do not comprehensively address these challenges. To address data quality in production environments at LinkedIn, we developed Data Sentinel, a declarative production-scale data validation platform. In a simple and well-structured configuration, users declaratively specify the desired data checks. Then, Data Sentinel performs these data checks and writes the results to an easily understandable report. Furthermore, Data Sentinel provides well-defined schemas for the configuration and report. This makes it easy for other systems to interface or integrate with Data Sentinel. To make Data Sentinel even easier to use, understand, and operate in production environments, we provide Data Sentinel Service (DSS), a complementary system to help specify data checks, schedule, deploy, and tune data validation jobs, and understand data checking results. The contributions of this paper include the following: 1) Data Sentinel, a declarative production-scale data validation platform successfully deployed at LinkedIn 2) A generic design to build and deploy similar systems for production environments 3) Experiences and lessons learned that can benefit practitioners with similar objectives.
数据哨兵:声明式生产规模数据验证平台
许多组织处理大数据来进行重要的业务操作和决策。因此,数据质量极大地影响了它们的成功。数据质量问题仍然普遍存在,估计每年给美国企业造成6000亿美元的损失。迄今为止,在生产环境中解决数据质量问题仍然面临许多挑战:容易定义高质量数据的属性;及时验证生产规模数据;调试质量差的数据;设计易于使用、理解和操作的数据质量解决方案;并设计数据质量解决方案,以便与其他系统轻松集成。当前的数据验证解决方案不能全面解决这些挑战。为了解决LinkedIn生产环境中的数据质量问题,我们开发了data Sentinel,这是一个声明式的生产规模数据验证平台。在简单且结构良好的配置中,用户声明式地指定所需的数据检查。然后,Data Sentinel执行这些数据检查,并将结果写入一个易于理解的报告。此外,Data Sentinel为配置和报告提供了良好定义的模式。这使得其他系统很容易与Data Sentinel进行接口或集成。为了使Data Sentinel在生产环境中更容易使用、理解和操作,我们提供了Data Sentinel Service (DSS),这是一个辅助系统,可帮助指定数据检查、调度、部署和调优数据验证作业,并理解数据检查结果。本文的贡献包括以下内容:1)Data Sentinel,一个成功部署在LinkedIn上的声明式生产规模数据验证平台;2)为生产环境构建和部署类似系统的通用设计;3)经验和教训,可以使具有类似目标的从业者受益。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信