Vertical partitioning for query processing over raw data

Proceedings of the 27th International Conference on Scientific and Statistical Database Management Pub Date : 2015-06-29 DOI:10.1145/2791347.2791369

Weijie Zhao, Yu Cheng, Florin Rusu

{"title":"Vertical partitioning for query processing over raw data","authors":"Weijie Zhao, Yu Cheng, Florin Rusu","doi":"10.1145/2791347.2791369","DOIUrl":null,"url":null,"abstract":"Traditional databases are not equipped with the adequate functionality to handle the volume and variety of \"Big Data\". Strict schema definition and data loading are prerequisites even for the most primitive query session. Raw data processing has been proposed as a schema-on-demand alternative that provides instant access to the data. When loading is an option, it is driven exclusively by the current-running query, resulting in sub-optimal performance across a query workload. In this paper, we investigate the problem of workload-driven raw data processing with partial loading. We model loading as fully-replicated binary vertical partitioning. We provide a linear mixed integer programming optimization formulation that we prove to be NP-hard. We design a two-stage heuristic that comes within close range of the optimal solution in a fraction of the time. We extend the optimization formulation and the heuristic to pipelined raw data processing, scenario in which data access and extraction are executed concurrently. We provide three case-studies over real data formats that confirm the accuracy of the model when implemented in a state-of-the-art pipelined operator for raw data processing.","PeriodicalId":225179,"journal":{"name":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th International Conference on Scientific and Statistical Database Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2791347.2791369","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

Abstract

Traditional databases are not equipped with the adequate functionality to handle the volume and variety of "Big Data". Strict schema definition and data loading are prerequisites even for the most primitive query session. Raw data processing has been proposed as a schema-on-demand alternative that provides instant access to the data. When loading is an option, it is driven exclusively by the current-running query, resulting in sub-optimal performance across a query workload. In this paper, we investigate the problem of workload-driven raw data processing with partial loading. We model loading as fully-replicated binary vertical partitioning. We provide a linear mixed integer programming optimization formulation that we prove to be NP-hard. We design a two-stage heuristic that comes within close range of the optimal solution in a fraction of the time. We extend the optimization formulation and the heuristic to pipelined raw data processing, scenario in which data access and extraction are executed concurrently. We provide three case-studies over real data formats that confirm the accuracy of the model when implemented in a state-of-the-art pipelined operator for raw data processing.

查看原文本刊更多论文

用于对原始数据进行查询处理的垂直分区

传统数据库不具备足够的功能来处理数量庞大、种类繁多的“大数据”。即使对于最原始的查询会话，严格的模式定义和数据加载也是先决条件。原始数据处理已被提议作为一种按需模式替代方案，提供对数据的即时访问。当加载是一个选项时，它完全由当前运行的查询驱动，从而导致跨查询工作负载的次优性能。在本文中，我们研究了部分负载下工作负载驱动的原始数据处理问题。我们将加载建模为完全复制的二进制垂直分区。我们提供了一个线性混合整数规划优化公式，我们证明了它是np困难的。我们设计了一个两阶段的启发式，在很短的时间内接近最优解的范围。我们将优化公式和启发式扩展到流水线的原始数据处理，其中数据访问和提取同时执行。我们提供了三个实际数据格式的案例研究，以确认模型在最先进的流水线操作器中执行原始数据处理时的准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 27th International Conference on Scientific and Statistical Database Management

自引率

0.00%

发文量