Finding Related Tables in Data Lakes for Interactive Data Science.

Proceedings. ACM-SIGMOD International Conference on Management of Data Pub Date : 2020-06-01 DOI:10.1145/3318464.3389726

Yi Zhang, Zachary G Ives

引用次数: 75

Abstract

Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.

查看原文本刊更多论文

在交互式数据科学中寻找数据湖中的相关表。

许多现代数据科学应用程序建立在数据湖、与模式无关的数据文件存储库和数据产品之上，它们提供的组织和管理功能有限。有必要在数据科学环境中构建数据湖搜索功能，这样科学家和分析师就可以找到对他们手头任务有用的表、模式、工作流和数据集。我们为Jupyter Notebook数据科学平台开发搜索和管理解决方案，使科学家能够增强训练数据，找到提取的潜在特征，清理数据，并找到可连接或可链接的表。我们的核心方法也可以推广到涉及程序或脚本执行的计算任务的其他设置。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings. ACM-SIGMOD International Conference on Management of Data

自引率

0.00%

发文量