Forecasting the cost of processing multi-join queries via hashing for main-memory databases

Feilong Liu, Spyros Blanas
{"title":"Forecasting the cost of processing multi-join queries via hashing for main-memory databases","authors":"Feilong Liu, Spyros Blanas","doi":"10.1145/2806777.2806944","DOIUrl":null,"url":null,"abstract":"Database management systems (DBMSs) carefully optimize complex multi-join queries to avoid expensive disk I/O. As servers today feature tens or hundreds of gigabytes of RAM, a significant fraction of many analytic databases becomes memory-resident. Even after careful tuning for an in-memory environment, a linear disk I/O model such as the one implemented in PostgreSQL may make query response time predictions that are up to 2× slower than the optimal multi-join query plan over memory-resident data. This paper introduces a memory I/O cost model to identify good evaluation strategies for complex query plans with multiple hash-based equi-joins over memory-resident data. The proposed cost model is carefully validated for accuracy using three different systems, including an Amazon EC2 instance, to control for hardware-specific differences. Prior work in parallel query evaluation has advocated right-deep and bushy trees for multi-join queries due to their greater parallelization and pipelining potential. A surprising finding is that the conventional wisdom from shared-nothing disk-based systems does not directly apply to the modern shared-everything memory hierarchy. As corroborated by our model, the performance gap between the optimal left-deep and right-deep query plan can grow to about 10× as the number of joins in the query increases.","PeriodicalId":275158,"journal":{"name":"Proceedings of the Sixth ACM Symposium on Cloud Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Sixth ACM Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2806777.2806944","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 23

Abstract

Database management systems (DBMSs) carefully optimize complex multi-join queries to avoid expensive disk I/O. As servers today feature tens or hundreds of gigabytes of RAM, a significant fraction of many analytic databases becomes memory-resident. Even after careful tuning for an in-memory environment, a linear disk I/O model such as the one implemented in PostgreSQL may make query response time predictions that are up to 2× slower than the optimal multi-join query plan over memory-resident data. This paper introduces a memory I/O cost model to identify good evaluation strategies for complex query plans with multiple hash-based equi-joins over memory-resident data. The proposed cost model is carefully validated for accuracy using three different systems, including an Amazon EC2 instance, to control for hardware-specific differences. Prior work in parallel query evaluation has advocated right-deep and bushy trees for multi-join queries due to their greater parallelization and pipelining potential. A surprising finding is that the conventional wisdom from shared-nothing disk-based systems does not directly apply to the modern shared-everything memory hierarchy. As corroborated by our model, the performance gap between the optimal left-deep and right-deep query plan can grow to about 10× as the number of joins in the query increases.
通过对主存数据库的散列预测处理多连接查询的成本
数据库管理系统(dbms)仔细地优化复杂的多连接查询,以避免昂贵的磁盘I/O。由于今天的服务器具有数十或数百gb的RAM,因此许多分析数据库的很大一部分都是内存驻留的。即使在对内存环境进行仔细调优之后,线性磁盘I/O模型(如PostgreSQL中实现的模型)的查询响应时间预测也可能比针对内存驻留数据的最佳多连接查询计划慢2倍。本文引入了一个内存I/O成本模型,用于识别对驻留内存数据具有多个基于散列的等同连接的复杂查询计划的良好评估策略。使用三个不同的系统(包括Amazon EC2实例)仔细验证了所建议的成本模型的准确性,以控制特定于硬件的差异。先前在并行查询计算方面的工作提倡对多连接查询使用右深树和稠密树,因为它们具有更大的并行化和管道化潜力。一个令人惊讶的发现是,基于磁盘的无共享系统的传统智慧并不直接适用于现代的无共享内存层次结构。我们的模型证实,随着查询中连接数量的增加,最优左深查询计划和右深查询计划之间的性能差距可以增长到大约10倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信