DataRinse: Semantic Transforms for Data Preparation Based on Code Mining

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment Pub Date : 2023-08-01 DOI:10.14778/3611540.3611628

Ibrahim Abdelaziz, Julian Dolby, Udayan Khurana, Horst Samulowitz, Kavitha Srinivas

引用次数: 0

Abstract

Data preparation is a crucial first step to any data analysis problem. This task is largely manual, performed by a person familiar with the data domain. DataRinse is a system designed to extract relevant transforms from large scale static analysis of repositories of code. Our motivation is that in any large enterprise, multiple personas such as data engineers and data scientists work on similar datasets. However, sharing or re-using that code is not obvious and difficult to execute. In this paper, we demonstrate DataRinse to handle data preparation, such that the system recommends code designed to help with the preparation of a column for data analysis more generally. We show that DataRinse does not simply shard expressions observed in code but also uses analysis to group expressions applied to the same field such that related transforms appear coherently to a user. It is a human-in-the-loop system where the users select relevant code snippets produced by DataRinse to apply on their dataset.

查看原文本刊更多论文

数据挖掘:基于代码挖掘的数据准备语义转换

数据准备是任何数据分析问题的关键第一步。这项任务主要是手工完成的，由熟悉数据领域的人员执行。DataRinse是一个旨在从代码库的大规模静态分析中提取相关转换的系统。我们的动机是，在任何大型企业中，数据工程师和数据科学家等多个角色都在处理类似的数据集。然而，共享或重用这些代码并不明显，而且很难执行。在本文中，我们演示了DataRinse来处理数据准备，这样系统就会推荐一些代码来帮助准备用于更普遍的数据分析的列。我们展示了DataRinse不仅对代码中观察到的表达式进行切分，而且还使用分析对应用于同一字段的表达式进行分组，以便相关转换对用户连贯地显示。它是一个人在循环系统，用户选择由DataRinse生成的相关代码片段应用于他们的数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Vldb Endowment Computer Science-General Computer Science

CiteScore

7.70

自引率

0.00%

发文量

期刊介绍： The Proceedings of the VLDB (PVLDB) welcomes original research papers on a broad range of research topics related to all aspects of data management, where systems issues play a significant role, such as data management system technology and information management infrastructures, including their very large scale of experimentation, novel architectures, and demanding applications as well as their underpinning theory. The scope of a submission for PVLDB is also described by the subject areas given below. Moreover, the scope of PVLDB is restricted to scientific areas that are covered by the combined expertise on the submission’s topic of the journal’s editorial board. Finally, the submission’s contributions should build on work already published in data management outlets, e.g., PVLDB, VLDBJ, ACM SIGMOD, IEEE ICDE, EDBT, ACM TODS, IEEE TKDE, and go beyond a syntactic citation.