To UDFs and Beyond: Demonstration of a Fully Decomposed Data Processor for General Data Wrangling Tasks

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment Pub Date : 2023-08-01 DOI:10.14778/3611540.3611610

Nico Schäfer, Damjan Gjurovski, Angjela Davitkova, Sebastian Michel

引用次数: 0

Abstract

While existing data management solutions try to keep up with novel data formats and features, a myriad of valuable functionality is often only accessible via programming language libraries. Particularly for machine learning tasks, there is a wealth of pre-trained models and easy-to-use libraries that allow a wide audience to harness state-of-the-art machine learning. We propose the demonstration of a highly modularized data processor for semi-structured data that can be extended by means of plain Python scripts. Next to commonly supported user-defined functions, the deep decomposition allows augmenting the core engine with additional index structures, customized import and export routines, and custom aggregation functions. For several use cases, we detail how user-defined modules can be quickly realized and invite the audience to write and apply custom code, to tailor provided code snippets that we bring along to own preferences to solve data analytics tasks involving sentiment analysis of Twitter tweets.

查看原文本刊更多论文

到udf及以后:用于一般数据争用任务的完全分解数据处理器的演示

虽然现有的数据管理解决方案试图跟上新的数据格式和特性，但许多有价值的功能通常只能通过编程语言库访问。特别是对于机器学习任务，有大量的预训练模型和易于使用的库，可以让广泛的受众利用最先进的机器学习。我们建议演示一个高度模块化的数据处理器，用于可以通过普通Python脚本进行扩展的半结构化数据。除了通常支持的用户定义函数之外，深度分解还允许使用额外的索引结构、自定义导入和导出例程以及自定义聚合函数来扩展核心引擎。对于几个用例，我们详细介绍了如何快速实现用户定义模块，并邀请读者编写和应用自定义代码，以定制提供的代码片段，我们将这些代码片段带到自己的偏好中，以解决涉及Twitter tweet情绪分析的数据分析任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Vldb Endowment Computer Science-General Computer Science

CiteScore

7.70

自引率

0.00%

发文量

期刊介绍： The Proceedings of the VLDB (PVLDB) welcomes original research papers on a broad range of research topics related to all aspects of data management, where systems issues play a significant role, such as data management system technology and information management infrastructures, including their very large scale of experimentation, novel architectures, and demanding applications as well as their underpinning theory. The scope of a submission for PVLDB is also described by the subject areas given below. Moreover, the scope of PVLDB is restricted to scientific areas that are covered by the combined expertise on the submission’s topic of the journal’s editorial board. Finally, the submission’s contributions should build on work already published in data management outlets, e.g., PVLDB, VLDBJ, ACM SIGMOD, IEEE ICDE, EDBT, ACM TODS, IEEE TKDE, and go beyond a syntactic citation.