A robust partitioning scheme for ad-hoc query workloads

Proceedings of the 2017 Symposium on Cloud Computing Pub Date : 2017-09-24 DOI:10.1145/3127479.3131613

Anil Shanbhag, Alekh Jindal, S. Madden, Jorge-Arnulfo Quiané-Ruiz, Aaron J. Elmore

{"title":"A robust partitioning scheme for ad-hoc query workloads","authors":"Anil Shanbhag, Alekh Jindal, S. Madden, Jorge-Arnulfo Quiané-Ruiz, Aaron J. Elmore","doi":"10.1145/3127479.3131613","DOIUrl":null,"url":null,"abstract":"Data partitioning is crucial to improving query performance several workload-based partitioning techniques have been proposed in database literature. However, many modern analytic applications involve ad-hoc or exploratory analysis where users do not have a representative query workload a priori. Static workload-based data partitioning techniques are therefore not suitable for such settings. In this paper, we propose Amoeba, a distributed storage system that uses adaptive multi-attribute data partitioning to efficiently support ad-hoc as well as recurring queries. Amoeba requires zero set-up and tuning effort, allowing analysts to get the benefits of partitioning without requiring an upfront query workload. The key idea is to build and maintain a partitioning tree on top of the dataset. The partitioning tree allows us to answer queries with predicates by reading a subset of the data. The initial partitioning tree is created without requiring an upfront query workload and Amoeba adapts it over time by incrementally modifying subtrees based on user queries using repartitioning. A prototype of Amoeba running on top of Apache Spark improves query performance by up to 7x over full scans and up to 2x over range-based partitioning techniques on TPC-H as well as a real-world workload.","PeriodicalId":20679,"journal":{"name":"Proceedings of the 2017 Symposium on Cloud Computing","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3127479.3131613","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

Abstract

Data partitioning is crucial to improving query performance several workload-based partitioning techniques have been proposed in database literature. However, many modern analytic applications involve ad-hoc or exploratory analysis where users do not have a representative query workload a priori. Static workload-based data partitioning techniques are therefore not suitable for such settings. In this paper, we propose Amoeba, a distributed storage system that uses adaptive multi-attribute data partitioning to efficiently support ad-hoc as well as recurring queries. Amoeba requires zero set-up and tuning effort, allowing analysts to get the benefits of partitioning without requiring an upfront query workload. The key idea is to build and maintain a partitioning tree on top of the dataset. The partitioning tree allows us to answer queries with predicates by reading a subset of the data. The initial partitioning tree is created without requiring an upfront query workload and Amoeba adapts it over time by incrementally modifying subtrees based on user queries using repartitioning. A prototype of Amoeba running on top of Apache Spark improves query performance by up to 7x over full scans and up to 2x over range-based partitioning techniques on TPC-H as well as a real-world workload.

查看原文本刊更多论文

针对临时查询工作负载的健壮分区方案

数据分区对于提高查询性能至关重要，数据库文献中提出了几种基于工作负载的分区技术。然而，许多现代分析应用程序涉及临时分析或探索性分析，其中用户没有代表性的先验查询工作负载。因此，基于工作负载的静态数据分区技术不适合这种设置。在本文中，我们提出了Amoeba，一个分布式存储系统，它使用自适应多属性数据分区来有效地支持ad-hoc和重复查询。Amoeba不需要任何设置和调优工作，允许分析人员在不需要预先查询工作负载的情况下获得分区的好处。关键思想是在数据集之上构建和维护一个分区树。分区树允许我们通过读取数据子集来回答带有谓词的查询。初始分区树是在不需要预先查询工作负载的情况下创建的，Amoeba通过使用重分区根据用户查询逐步修改子树来调整初始分区树。在Apache Spark上运行的Amoeba原型比完全扫描的查询性能提高了7倍，比基于范围的分区技术在TPC-H和实际工作负载上的查询性能提高了2倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2017 Symposium on Cloud Computing

自引率

0.00%

发文量