Enabling useful provenance in scripting languages with a human-in-the-loop

Yuze Lou, Michael J. Cafarella
{"title":"Enabling useful provenance in scripting languages with a human-in-the-loop","authors":"Yuze Lou, Michael J. Cafarella","doi":"10.1145/3546930.3547494","DOIUrl":null,"url":null,"abstract":"Most data scientists must build substantial data pipelines using scripting languages like Python and R. These pipelines are hard to get correct due to the large volume of data they process (thus the long execution time), and the fact that they are tested mainly by inspection of output data quality. It is therefore crucial for developers to reason about data through each step in the pipeline, starting from the raw input; this information is akin to data provenance in a relational setting. Past efforts for capturing data provenance for scripting languages have required substantial manual modifications to the scripts, or else yield information that is too inflexible for many debugging tasks. We instead propose a \"human-in-the-loop\" provenance generation model with three key improvements: (1) allowing humans to express the desired provenance through a provenance schema, (2) enabling one-time execution capture of scripts to produce traces that are later combined with different provenance schemata to yield useful provenance for different tasks, (3) providing a modular rule-based recommendation component to help design provenance schemata through a user interaction interface. We describe the concepts, the user experience with our system, explain the system components, and present preliminary experiment results.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"32 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3546930.3547494","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Most data scientists must build substantial data pipelines using scripting languages like Python and R. These pipelines are hard to get correct due to the large volume of data they process (thus the long execution time), and the fact that they are tested mainly by inspection of output data quality. It is therefore crucial for developers to reason about data through each step in the pipeline, starting from the raw input; this information is akin to data provenance in a relational setting. Past efforts for capturing data provenance for scripting languages have required substantial manual modifications to the scripts, or else yield information that is too inflexible for many debugging tasks. We instead propose a "human-in-the-loop" provenance generation model with three key improvements: (1) allowing humans to express the desired provenance through a provenance schema, (2) enabling one-time execution capture of scripts to produce traces that are later combined with different provenance schemata to yield useful provenance for different tasks, (3) providing a modular rule-based recommendation component to help design provenance schemata through a user interaction interface. We describe the concepts, the user experience with our system, explain the system components, and present preliminary experiment results.
在脚本语言中使用人在循环中启用有用的来源
大多数数据科学家必须使用脚本语言(如Python和r)构建大量的数据管道。由于处理的数据量很大(因此执行时间很长),并且主要通过检查输出数据质量来测试这些管道,因此很难获得正确的数据管道。因此,对于开发者来说,从原始输入开始,通过管道中的每个步骤来推断数据是至关重要的;此信息类似于关系设置中的数据来源。过去为脚本语言捕获数据来源的工作需要对脚本进行大量的手工修改,否则产生的信息对于许多调试任务来说太不灵活了。相反,我们提出了一个“人在循环中”的来源生成模型,该模型有三个关键改进:(1)允许人们通过一个来源模式来表达期望的来源,(2)允许脚本的一次性执行捕获来产生跟踪,这些跟踪随后与不同的来源模式相结合,从而为不同的任务产生有用的来源,(3)提供一个模块化的基于规则的推荐组件,通过用户交互界面来帮助设计来源模式。我们描述了系统的概念和用户体验,解释了系统的组成部分,并给出了初步的实验结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信