Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance

IF 2.2 2区计算机科学 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Database Systems Pub Date : 2024-02-09 DOI:10.1145/3644385

Adriane Chapman, Luca Lauro, Paolo Missier, Riccardo Torlone

{"title":"Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance","authors":"Adriane Chapman, Luca Lauro, Paolo Missier, Riccardo Torlone","doi":"10.1145/3644385","DOIUrl":null,"url":null,"abstract":"<p>Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, we aim to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a <i>provenance semantics</i> embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of that reference implementation to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.</p>","PeriodicalId":50915,"journal":{"name":"ACM Transactions on Database Systems","volume":"107 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Database Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3644385","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, we aim to provide data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a provenance semantics embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semi-automatic way, complete provenance documents that account for the entire pipeline. We report on the ability of that reference implementation to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.

查看原文本刊更多论文

利用细粒度证明支持更好地洞察数据科学管道

成功的数据驱动科学需要复杂的数据工程管道来清理、转换和改变数据，为机器学习做准备，而只有当管道中的每一步都有理有据，并能解释其对数据的影响时，才能取得稳健的结果。在这个框架中，我们的目标是为数据科学家提供设施，让他们深入了解从原始输入到准备用于学习的训练集这一过程中的每一步是如何影响数据的。从数据科学环境中常用的一组可扩展的数据准备操作符开始，我们在这项工作中提出了一种出处管理基础架构，用于生成、存储和查询非常细化的数据转换记录，尽可能在数据集内的单个元素级别上进行。然后，通过对一组核心数据科学预处理操作符的正式定义，我们推导出了一种出处语义，该语义由一系列以 PROV（一种数据出处的标准模型）表达的模板所体现。以这些模板为参考，我们的出处生成算法可以推广到任何具有可观测输入/输出对的操作符。我们提供了应用级出处捕获库的原型实现，以半自动的方式生成完整的出处文档，说明整个流水线的情况。我们报告了该参考实现在实际 ML 基准管道和 TCP-DI 合成数据中捕获出处的能力。最后，我们展示了如何利用收集到的出处来回答一系列出处基准查询，这些查询是数据科学堆栈交换（Data Science Stack Exchange）上一些常见管道检查问题的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Database Systems 工程技术-计算机：软件工程

CiteScore

5.60

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Heavily used in both academic and corporate R&D settings, ACM Transactions on Database Systems (TODS) is a key publication for computer scientists working in data abstraction, data modeling, and designing data management systems. Topics include storage and retrieval, transaction management, distributed and federated databases, semantics of data, intelligent databases, and operations and algorithms relating to these areas. In this rapidly changing field, TODS provides insights into the thoughts of the best minds in database R&D.