Forecasting the cost of processing multi-join queries via hashing for main-memory databases

Proceedings of the Sixth ACM Symposium on Cloud Computing Pub Date : 2015-07-11 DOI:10.1145/2806777.2806944

Feilong Liu, Spyros Blanas

{"title":"Forecasting the cost of processing multi-join queries via hashing for main-memory databases","authors":"Feilong Liu, Spyros Blanas","doi":"10.1145/2806777.2806944","DOIUrl":null,"url":null,"abstract":"Database management systems (DBMSs) carefully optimize complex multi-join queries to avoid expensive disk I/O. As servers today feature tens or hundreds of gigabytes of RAM, a significant fraction of many analytic databases becomes memory-resident. Even after careful tuning for an in-memory environment, a linear disk I/O model such as the one implemented in PostgreSQL may make query response time predictions that are up to 2× slower than the optimal multi-join query plan over memory-resident data. This paper introduces a memory I/O cost model to identify good evaluation strategies for complex query plans with multiple hash-based equi-joins over memory-resident data. The proposed cost model is carefully validated for accuracy using three different systems, including an Amazon EC2 instance, to control for hardware-specific differences. Prior work in parallel query evaluation has advocated right-deep and bushy trees for multi-join queries due to their greater parallelization and pipelining potential. A surprising finding is that the conventional wisdom from shared-nothing disk-based systems does not directly apply to the modern shared-everything memory hierarchy. As corroborated by our model, the performance gap between the optimal left-deep and right-deep query plan can grow to about 10× as the number of joins in the query increases.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixth ACM Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2806777.2806944","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Database management systems (DBMSs) carefully optimize complex multi-join queries to avoid expensive disk I/O. As servers today feature tens or hundreds of gigabytes of RAM, a significant fraction of many analytic databases becomes memory-resident. Even after careful tuning for an in-memory environment, a linear disk I/O model such as the one implemented in PostgreSQL may make query response time predictions that are up to 2× slower than the optimal multi-join query plan over memory-resident data. This paper introduces a memory I/O cost model to identify good evaluation strategies for complex query plans with multiple hash-based equi-joins over memory-resident data. The proposed cost model is carefully validated for accuracy using three different systems, including an Amazon EC2 instance, to control for hardware-specific differences. Prior work in parallel query evaluation has advocated right-deep and bushy trees for multi-join queries due to their greater parallelization and pipelining potential. A surprising finding is that the conventional wisdom from shared-nothing disk-based systems does not directly apply to the modern shared-everything memory hierarchy. As corroborated by our model, the performance gap between the optimal left-deep and right-deep query plan can grow to about 10× as the number of joins in the query increases.

查看原文本刊更多论文

通过对主存数据库的散列预测处理多连接查询的成本

数据库管理系统(dbms)仔细地优化复杂的多连接查询，以避免昂贵的磁盘I/O。由于今天的服务器具有数十或数百gb的RAM，因此许多分析数据库的很大一部分都是内存驻留的。即使在对内存环境进行仔细调优之后，线性磁盘I/O模型(如PostgreSQL中实现的模型)的查询响应时间预测也可能比针对内存驻留数据的最佳多连接查询计划慢2倍。本文引入了一个内存I/O成本模型，用于识别对驻留内存数据具有多个基于散列的等同连接的复杂查询计划的良好评估策略。使用三个不同的系统(包括Amazon EC2实例)仔细验证了所建议的成本模型的准确性，以控制特定于硬件的差异。先前在并行查询计算方面的工作提倡对多连接查询使用右深树和稠密树，因为它们具有更大的并行化和管道化潜力。一个令人惊讶的发现是，基于磁盘的无共享系统的传统智慧并不直接适用于现代的无共享内存层次结构。我们的模型证实，随着查询中连接数量的增加，最优左深查询计划和右深查询计划之间的性能差距可以增长到大约10倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Sixth ACM Symposium on Cloud Computing

自引率

0.00%

发文量