ML-based Cross-Platform Query Optimization

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI:10.1109/ICDE48307.2020.00132

Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, S. Chawla

{"title":"ML-based Cross-Platform Query Optimization","authors":"Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Bertty Contreras-Rojas, Rodrigo Pardo-Meza, Anis Troudi, S. Chawla","doi":"10.1109/ICDE48307.2020.00132","DOIUrl":null,"url":null,"abstract":"Cost-based optimization is widely known to suffer from a major weakness: administrators spend a significant amount of time to tune the associated cost models. This problem only gets exacerbated in cross-platform settings as there are many more parameters that need to be tuned. In the era of machine learning (ML), the first step to remedy this problem is to replace the cost model of the optimizer with an ML model. However, such a solution brings in two major challenges. First, the optimizer has to transform a query plan to a vector million times during plan enumeration incurring a very high overhead. Second, a lot of training data is required to effectively train the ML model. We overcome these challenges in Robopt, a novel vector-based optimizer we have built for Rheem, a cross-platform system. Robopt not only uses an ML model to prune the search space but also bases the entire plan enumeration on a set of algebraic operations that operate on vectors, which are a natural fit to the ML model. This leads to both speed-up and scale-up of the enumeration process by exploiting modern CPUs via vectorization. We also accompany Robopt with a scalable training data generator for building its ML model. Our evaluation shows that (i) the vector-based approach is more efficient and scalable than simply using an ML model and (ii) Robopt matches and, in some cases, improves Rheem’s cost-based optimizer in choosing good plans without requiring any tuning effort.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"178 1","pages":"1489-1500"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE48307.2020.00132","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Cost-based optimization is widely known to suffer from a major weakness: administrators spend a significant amount of time to tune the associated cost models. This problem only gets exacerbated in cross-platform settings as there are many more parameters that need to be tuned. In the era of machine learning (ML), the first step to remedy this problem is to replace the cost model of the optimizer with an ML model. However, such a solution brings in two major challenges. First, the optimizer has to transform a query plan to a vector million times during plan enumeration incurring a very high overhead. Second, a lot of training data is required to effectively train the ML model. We overcome these challenges in Robopt, a novel vector-based optimizer we have built for Rheem, a cross-platform system. Robopt not only uses an ML model to prune the search space but also bases the entire plan enumeration on a set of algebraic operations that operate on vectors, which are a natural fit to the ML model. This leads to both speed-up and scale-up of the enumeration process by exploiting modern CPUs via vectorization. We also accompany Robopt with a scalable training data generator for building its ML model. Our evaluation shows that (i) the vector-based approach is more efficient and scalable than simply using an ML model and (ii) Robopt matches and, in some cases, improves Rheem’s cost-based optimizer in choosing good plans without requiring any tuning effort.

查看原文本刊更多论文

基于ml的跨平台查询优化

众所周知，基于成本的优化有一个主要缺点:管理员花费大量时间来调优相关的成本模型。这个问题只会在跨平台设置中变得更加严重，因为有更多的参数需要调整。在机器学习(ML)时代，解决这个问题的第一步是用ML模型取代优化器的成本模型。然而，这样的解决方案带来了两个主要挑战。首先，优化器必须在计划枚举期间将查询计划转换为向量百万次，从而产生非常高的开销。其次，为了有效地训练ML模型，需要大量的训练数据。我们在Robopt中克服了这些挑战，Robopt是我们为跨平台系统Rheem构建的一种新颖的基于矢量的优化器。Robopt不仅使用ML模型来修剪搜索空间，而且还将整个计划枚举建立在一组对向量进行操作的代数操作的基础上，这与ML模型非常适合。这通过向量化来利用现代cpu，从而导致枚举过程的加速和扩展。我们还为Robopt提供了一个可扩展的训练数据生成器，用于构建其ML模型。我们的评估表明:(i)基于向量的方法比简单地使用ML模型更有效和可扩展;(ii) Robopt匹配，并且在某些情况下，改进了Rheem的基于成本的优化器，可以在不需要任何调整的情况下选择良好的计划。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 36th International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量