Enabling useful provenance in scripting languages with a human-in-the-loop

Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.) Pub Date : 2022-06-12 DOI:10.1145/3546930.3547494

Yuze Lou, Michael J. Cafarella

{"title":"Enabling useful provenance in scripting languages with a human-in-the-loop","authors":"Yuze Lou, Michael J. Cafarella","doi":"10.1145/3546930.3547494","DOIUrl":null,"url":null,"abstract":"Most data scientists must build substantial data pipelines using scripting languages like Python and R. These pipelines are hard to get correct due to the large volume of data they process (thus the long execution time), and the fact that they are tested mainly by inspection of output data quality. It is therefore crucial for developers to reason about data through each step in the pipeline, starting from the raw input; this information is akin to data provenance in a relational setting. Past efforts for capturing data provenance for scripting languages have required substantial manual modifications to the scripts, or else yield information that is too inflexible for many debugging tasks. We instead propose a \"human-in-the-loop\" provenance generation model with three key improvements: (1) allowing humans to express the desired provenance through a provenance schema, (2) enabling one-time execution capture of scripts to produce traces that are later combined with different provenance schemata to yield useful provenance for different tasks, (3) providing a modular rule-based recommendation component to help design provenance schemata through a user interaction interface. We describe the concepts, the user experience with our system, explain the system components, and present preliminary experiment results.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3546930.3547494","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Most data scientists must build substantial data pipelines using scripting languages like Python and R. These pipelines are hard to get correct due to the large volume of data they process (thus the long execution time), and the fact that they are tested mainly by inspection of output data quality. It is therefore crucial for developers to reason about data through each step in the pipeline, starting from the raw input; this information is akin to data provenance in a relational setting. Past efforts for capturing data provenance for scripting languages have required substantial manual modifications to the scripts, or else yield information that is too inflexible for many debugging tasks. We instead propose a "human-in-the-loop" provenance generation model with three key improvements: (1) allowing humans to express the desired provenance through a provenance schema, (2) enabling one-time execution capture of scripts to produce traces that are later combined with different provenance schemata to yield useful provenance for different tasks, (3) providing a modular rule-based recommendation component to help design provenance schemata through a user interaction interface. We describe the concepts, the user experience with our system, explain the system components, and present preliminary experiment results.

查看原文本刊更多论文

在脚本语言中使用人在循环中启用有用的来源

大多数数据科学家必须使用脚本语言(如Python和r)构建大量的数据管道。由于处理的数据量很大(因此执行时间很长)，并且主要通过检查输出数据质量来测试这些管道，因此很难获得正确的数据管道。因此，对于开发者来说，从原始输入开始，通过管道中的每个步骤来推断数据是至关重要的;此信息类似于关系设置中的数据来源。过去为脚本语言捕获数据来源的工作需要对脚本进行大量的手工修改，否则产生的信息对于许多调试任务来说太不灵活了。相反，我们提出了一个“人在循环中”的来源生成模型，该模型有三个关键改进:(1)允许人们通过一个来源模式来表达期望的来源，(2)允许脚本的一次性执行捕获来产生跟踪，这些跟踪随后与不同的来源模式相结合，从而为不同的任务产生有用的来源，(3)提供一个模块化的基于规则的推荐组件，通过用户交互界面来帮助设计来源模式。我们描述了系统的概念和用户体验，解释了系统的组成部分，并给出了初步的实验结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)

自引率

0.00%

发文量