Query Optimization Time: The New Bottleneck in Real-time Analytics

Proceedings of the 3rd VLDB Workshop on In-Memory Data Mangement and Analytics Pub Date : 2015-08-31 DOI:10.1145/2803140.2803148

Rajkumar Sen, Jack Chen, Nika Jimsheleishvilli

{"title":"Query Optimization Time: The New Bottleneck in Real-time Analytics","authors":"Rajkumar Sen, Jack Chen, Nika Jimsheleishvilli","doi":"10.1145/2803140.2803148","DOIUrl":null,"url":null,"abstract":"In the recent past, in-memory distributed database management systems have become increasingly popular to manage and query huge amounts of data. For an in-memory distributed database like MemSQL, it is imperative that the analytical queries run fast. A huge proportion of MemSQL's customer workloads have ad-hoc analytical queries that need to finish execution within a second or a few seconds. This leaves us with very little time to perform query optimization for complex queries involving several joins, aggregations, sub-queries etc. Even for queries that are not ad-hoc, a change in data statistics can trigger query re-optimization. Query Optimization, if not done intelligently, could very well be the bottleneck for such complex analytical queries that require real-time response. In this paper, we outline some of the early steps that we have taken to reduce the query optimization time without sacrificing plan quality. We optimized the Enumerator (the optimizer component that determines operator order), which takes up bulk of the optimization time. Generating bushy plans inside the Enumerator can be a bottleneck and so we used heuristics to generate bushy plans via query rewrite. We also implemented new distribution aware greedy heuristics to generate a good starting candidate plan that significantly prunes out states during search space analysis inside the Enumerator. We demonstrate the effectiveness of these techniques over several queries in TPC-H and TPC-DS benchmarks.","PeriodicalId":175654,"journal":{"name":"Proceedings of the 3rd VLDB Workshop on In-Memory Data Mangement and Analytics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd VLDB Workshop on In-Memory Data Mangement and Analytics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2803140.2803148","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

In the recent past, in-memory distributed database management systems have become increasingly popular to manage and query huge amounts of data. For an in-memory distributed database like MemSQL, it is imperative that the analytical queries run fast. A huge proportion of MemSQL's customer workloads have ad-hoc analytical queries that need to finish execution within a second or a few seconds. This leaves us with very little time to perform query optimization for complex queries involving several joins, aggregations, sub-queries etc. Even for queries that are not ad-hoc, a change in data statistics can trigger query re-optimization. Query Optimization, if not done intelligently, could very well be the bottleneck for such complex analytical queries that require real-time response. In this paper, we outline some of the early steps that we have taken to reduce the query optimization time without sacrificing plan quality. We optimized the Enumerator (the optimizer component that determines operator order), which takes up bulk of the optimization time. Generating bushy plans inside the Enumerator can be a bottleneck and so we used heuristics to generate bushy plans via query rewrite. We also implemented new distribution aware greedy heuristics to generate a good starting candidate plan that significantly prunes out states during search space analysis inside the Enumerator. We demonstrate the effectiveness of these techniques over several queries in TPC-H and TPC-DS benchmarks.

查看原文本刊更多论文

查询优化时间:实时分析的新瓶颈

近年来，内存分布式数据库管理系统在管理和查询海量数据方面变得越来越流行。对于内存中的分布式数据库(如MemSQL)，分析查询必须快速运行。很大一部分MemSQL的客户工作负载都有特别的分析查询，需要在一秒或几秒内完成执行。这使得我们很少有时间对涉及多个连接、聚合、子查询等的复杂查询进行查询优化。即使对于非即席查询，数据统计信息的更改也可能触发查询重新优化。查询优化，如果不智能地完成，很可能成为需要实时响应的复杂分析查询的瓶颈。在本文中，我们概述了为了在不牺牲计划质量的情况下减少查询优化时间而采取的一些早期步骤。我们优化了Enumerator(决定操作符顺序的优化器组件)，它占用了大部分优化时间。在Enumerator内部生成稠密计划可能是一个瓶颈，因此我们使用启发式方法通过查询重写来生成稠密计划。我们还实现了新的分布感知贪婪启发式算法，以生成一个良好的启动候选计划，该计划可以在Enumerator内部的搜索空间分析期间显著地修剪状态。我们在TPC-H和TPC-DS基准测试中通过几个查询演示了这些技术的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 3rd VLDB Workshop on In-Memory Data Mangement and Analytics

自引率

0.00%

发文量