Pollock: A Data Loading Benchmark

Proc. VLDB Endow. Pub Date : 2023-04-01 DOI:10.14778/3594512.3594518

Gerardo Vitagliano, Mazhar Hameed, Lan Jiang, Lucas Reisener, Eugene Wu, Felix Naumann

引用次数: 2

Abstract

Any system at play in a data-driven project has a fundamental requirement: the ability to load data. The de-facto standard format to distribute and consume raw data is csv. Yet, the plain text and flexible nature of this format make such files often difficult to parse and correctly load their content, requiring cumbersome data preparation steps. We propose a benchmark to assess the robustness of systems in loading data from non-standard csv formats and with structural inconsistencies. First, we formalize a model to describe the issues that affect real-world files and use it to derive a systematic "pollution" process to generate dialects for any given grammar. Our benchmark leverages the pollution framework for the csv format. To guide pollution, we have surveyed thousands of real-world, publicly available csv files, recording the problems we encountered. We demonstrate the applicability of our benchmark by testing and scoring 16 different systems: popular csv parsing frameworks, relational database tools, spreadsheet systems, and a data visualization tool.

查看原文本刊更多论文

Pollock:一个数据加载基准

数据驱动项目中的任何系统都有一个基本要求:加载数据的能力。分发和使用原始数据的事实上的标准格式是csv。然而，这种格式的纯文本和灵活的特性通常使这类文件难以解析和正确加载其内容，需要繁琐的数据准备步骤。我们提出了一个基准来评估系统在从非标准csv格式加载数据和结构不一致时的鲁棒性。首先，我们形式化了一个模型来描述影响现实世界文件的问题，并使用它来导出一个系统的“污染”过程，以生成任何给定语法的方言。我们的基准利用csv格式的污染框架。为了指导污染，我们调查了数千个真实的、公开的csv文件，记录了我们遇到的问题。我们通过测试和评分16个不同的系统来证明基准的适用性:流行的csv解析框架、关系数据库工具、电子表格系统和数据可视化工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量