摩尔驱动的并行性:多核时代的numa感知查询评估框架

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data Pub Date : 2014-06-18 DOI:10.1145/2588555.2610507

Viktor Leis, P. Boncz, A. Kemper, Thomas Neumann

{"title":"摩尔驱动的并行性:多核时代的numa感知查询评估框架","authors":"Viktor Leis, P. Boncz, A. Kemper, Thomas Neumann","doi":"10.1145/2588555.2610507","DOIUrl":null,"url":null,"abstract":"With modern computer architecture evolving, two problems conspire against the state-of-the-art approaches in parallel query execution: (i) to take advantage of many-cores, all query work must be distributed evenly among (soon) hundreds of threads in order to achieve good speedup, yet (ii) dividing the work evenly is difficult even with accurate data statistics due to the complexity of modern out-of-order cores. As a result, the existing approaches for plan-driven parallelism run into load balancing and context-switching bottlenecks, and therefore no longer scale. A third problem faced by many-core architectures is the decentralization of memory controllers, which leads to Non-Uniform Memory Access (NUMA). In response, we present the morsel-driven query execution framework, where scheduling becomes a fine-grained run-time task that is NUMA-aware. Morsel-driven query processing takes small fragments of input data (morsels) and schedules these to worker threads that run entire operator pipelines until the next pipeline breaker. The degree of parallelism is not baked into the plan but can elastically change during query execution, so the dispatcher can react to execution speed of different morsels but also adjust resources dynamically in response to newly arriving queries in the workload. Further, the dispatcher is aware of data locality of the NUMA-local morsels and operator state, such that the great majority of executions takes place on NUMA-local memory. Our evaluation on the TPC-H and SSB benchmarks shows extremely high absolute performance and an average speedup of over 30 with 32 cores.","PeriodicalId":314442,"journal":{"name":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"255","resultStr":"{\"title\":\"Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age\",\"authors\":\"Viktor Leis, P. Boncz, A. Kemper, Thomas Neumann\",\"doi\":\"10.1145/2588555.2610507\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With modern computer architecture evolving, two problems conspire against the state-of-the-art approaches in parallel query execution: (i) to take advantage of many-cores, all query work must be distributed evenly among (soon) hundreds of threads in order to achieve good speedup, yet (ii) dividing the work evenly is difficult even with accurate data statistics due to the complexity of modern out-of-order cores. As a result, the existing approaches for plan-driven parallelism run into load balancing and context-switching bottlenecks, and therefore no longer scale. A third problem faced by many-core architectures is the decentralization of memory controllers, which leads to Non-Uniform Memory Access (NUMA). In response, we present the morsel-driven query execution framework, where scheduling becomes a fine-grained run-time task that is NUMA-aware. Morsel-driven query processing takes small fragments of input data (morsels) and schedules these to worker threads that run entire operator pipelines until the next pipeline breaker. The degree of parallelism is not baked into the plan but can elastically change during query execution, so the dispatcher can react to execution speed of different morsels but also adjust resources dynamically in response to newly arriving queries in the workload. Further, the dispatcher is aware of data locality of the NUMA-local morsels and operator state, such that the great majority of executions takes place on NUMA-local memory. Our evaluation on the TPC-H and SSB benchmarks shows extremely high absolute performance and an average speedup of over 30 with 32 cores.\",\"PeriodicalId\":314442,\"journal\":{\"name\":\"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-06-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"255\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2588555.2610507\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2588555.2610507","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 255

摘要

随着现代计算机体系结构的发展，并行查询执行的最先进方法面临两个问题:(i)为了利用多核，所有查询工作必须均匀地分布在(很快)数百个线程中，以获得良好的加速;(ii)由于现代乱序核的复杂性，即使有准确的数据统计，也很难均匀地分配工作。因此，现有的计划驱动并行性方法会遇到负载平衡和上下文切换瓶颈，因此不再具有可伸缩性。多核架构面临的第三个问题是内存控制器的分散化，这会导致非统一内存访问(NUMA)。作为回应，我们提出了块驱动的查询执行框架，其中调度变成了一个细粒度的运行时任务，它是numa感知的。小块驱动的查询处理获取一小段输入数据(小块)，并将其调度到工作线程，这些线程运行整个操作符管道，直到下一个管道中断。并行度没有考虑到计划中，但可以在查询执行期间弹性地更改，因此调度程序可以对不同部分的执行速度作出反应，也可以动态地调整资源以响应工作负载中新到达的查询。此外，调度程序知道NUMA-local片段的数据位置和操作符状态，因此绝大多数执行都发生在NUMA-local内存上。我们对TPC-H和SSB基准测试的评估显示，在32核的情况下，绝对性能非常高，平均加速超过30。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age

With modern computer architecture evolving, two problems conspire against the state-of-the-art approaches in parallel query execution: (i) to take advantage of many-cores, all query work must be distributed evenly among (soon) hundreds of threads in order to achieve good speedup, yet (ii) dividing the work evenly is difficult even with accurate data statistics due to the complexity of modern out-of-order cores. As a result, the existing approaches for plan-driven parallelism run into load balancing and context-switching bottlenecks, and therefore no longer scale. A third problem faced by many-core architectures is the decentralization of memory controllers, which leads to Non-Uniform Memory Access (NUMA). In response, we present the morsel-driven query execution framework, where scheduling becomes a fine-grained run-time task that is NUMA-aware. Morsel-driven query processing takes small fragments of input data (morsels) and schedules these to worker threads that run entire operator pipelines until the next pipeline breaker. The degree of parallelism is not baked into the plan but can elastically change during query execution, so the dispatcher can react to execution speed of different morsels but also adjust resources dynamically in response to newly arriving queries in the workload. Further, the dispatcher is aware of data locality of the NUMA-local morsels and operator state, such that the great majority of executions takes place on NUMA-local memory. Our evaluation on the TPC-H and SSB benchmarks shows extremely high absolute performance and an average speedup of over 30 with 32 cores.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

自引率

0.00%

发文量