Reducing cache coherence traffic with hierarchical directory cache and NUMA-aware runtime scheduling

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI:10.1145/2967938.2967962

Paul Caheny, Marc Casas, Miquel Moretó, Hervé Gloaguen, Maxime Saintes, E. Ayguadé, Jesús Labarta, M. Valero

{"title":"Reducing cache coherence traffic with hierarchical directory cache and NUMA-aware runtime scheduling","authors":"Paul Caheny, Marc Casas, Miquel Moretó, Hervé Gloaguen, Maxime Saintes, E. Ayguadé, Jesús Labarta, M. Valero","doi":"10.1145/2967938.2967962","DOIUrl":null,"url":null,"abstract":"Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, the flat memory address space they offer considerably improves programmability. However, ccNUMA architectures require sophisticated and expensive cache coherence protocols to enforce correctness during parallel executions, which trigger a significant amount of on- and off-chip traffic in the system. This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform through the use of a joint hardware/software approach. For several benchmarks, we study coherence traffic in detail under the influence of an added hierarchical cache layer in the directory protocol combined with runtime managed NUMA-aware scheduling and data allocation techniques to make most efficient use of the added hardware. The effectiveness of this joint approach is demonstrated by speedups of 1.23× to 2.54× and coherence traffic reductions between 44% and 77% in comparison to NUMA-oblivious scheduling and data allocation. Furthermore, we show that the NUMA-aware techniques we employ at the runtime level are crucial to ensure the added hierarchical layer in the directory coherence protocol does not introduce significant coherence traffic to the system.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2967938.2967962","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

Abstract

Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, the flat memory address space they offer considerably improves programmability. However, ccNUMA architectures require sophisticated and expensive cache coherence protocols to enforce correctness during parallel executions, which trigger a significant amount of on- and off-chip traffic in the system. This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform through the use of a joint hardware/software approach. For several benchmarks, we study coherence traffic in detail under the influence of an added hierarchical cache layer in the directory protocol combined with runtime managed NUMA-aware scheduling and data allocation techniques to make most efficient use of the added hardware. The effectiveness of this joint approach is demonstrated by speedups of 1.23× to 2.54× and coherence traffic reductions between 44% and 77% in comparison to NUMA-oblivious scheduling and data allocation. Furthermore, we show that the NUMA-aware techniques we employ at the runtime level are crucial to ensure the added hierarchical layer in the directory coherence protocol does not introduce significant coherence traffic to the system.

查看原文本刊更多论文

通过分层目录缓存和numa感知的运行时调度减少缓存一致性流量

缓存一致NUMA (ccNUMA)架构是一种广泛的范例，因为它们提供了扩展核心数量和内存容量的好处。此外，它们提供的平面内存地址空间也大大提高了可编程性。然而，ccNUMA架构需要复杂且昂贵的缓存一致性协议来在并行执行期间强制执行正确性，这将在系统中触发大量的片内和片外流量。本文分析了如何通过使用联合硬件/软件方法在大型真实ccNUMA平台中最好地约束相干流量。对于几个基准测试，我们详细研究了在目录协议中添加的分层缓存层以及运行时管理的numa感知调度和数据分配技术的影响下的一致性流量，以最有效地利用添加的硬件。与numa无关的调度和数据分配相比，这种联合方法的效率提高了1.23到2.54倍，相干流量减少了44%到77%。此外，我们表明，我们在运行时级别采用的numa感知技术对于确保目录一致性协议中添加的分层层不会向系统引入显著的一致性流量至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)

自引率

0.00%

发文量