Reducing cache coherence traffic with hierarchical directory cache and NUMA-aware runtime scheduling

Paul Caheny, Marc Casas, Miquel Moretó, Hervé Gloaguen, Maxime Saintes, E. Ayguadé, Jesús Labarta, M. Valero
{"title":"Reducing cache coherence traffic with hierarchical directory cache and NUMA-aware runtime scheduling","authors":"Paul Caheny, Marc Casas, Miquel Moretó, Hervé Gloaguen, Maxime Saintes, E. Ayguadé, Jesús Labarta, M. Valero","doi":"10.1145/2967938.2967962","DOIUrl":null,"url":null,"abstract":"Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, the flat memory address space they offer considerably improves programmability. However, ccNUMA architectures require sophisticated and expensive cache coherence protocols to enforce correctness during parallel executions, which trigger a significant amount of on- and off-chip traffic in the system. This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform through the use of a joint hardware/software approach. For several benchmarks, we study coherence traffic in detail under the influence of an added hierarchical cache layer in the directory protocol combined with runtime managed NUMA-aware scheduling and data allocation techniques to make most efficient use of the added hardware. The effectiveness of this joint approach is demonstrated by speedups of 1.23× to 2.54× and coherence traffic reductions between 44% and 77% in comparison to NUMA-oblivious scheduling and data allocation. Furthermore, we show that the NUMA-aware techniques we employ at the runtime level are crucial to ensure the added hierarchical layer in the directory coherence protocol does not introduce significant coherence traffic to the system.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2967938.2967962","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, the flat memory address space they offer considerably improves programmability. However, ccNUMA architectures require sophisticated and expensive cache coherence protocols to enforce correctness during parallel executions, which trigger a significant amount of on- and off-chip traffic in the system. This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform through the use of a joint hardware/software approach. For several benchmarks, we study coherence traffic in detail under the influence of an added hierarchical cache layer in the directory protocol combined with runtime managed NUMA-aware scheduling and data allocation techniques to make most efficient use of the added hardware. The effectiveness of this joint approach is demonstrated by speedups of 1.23× to 2.54× and coherence traffic reductions between 44% and 77% in comparison to NUMA-oblivious scheduling and data allocation. Furthermore, we show that the NUMA-aware techniques we employ at the runtime level are crucial to ensure the added hierarchical layer in the directory coherence protocol does not introduce significant coherence traffic to the system.
通过分层目录缓存和numa感知的运行时调度减少缓存一致性流量
缓存一致NUMA (ccNUMA)架构是一种广泛的范例,因为它们提供了扩展核心数量和内存容量的好处。此外,它们提供的平面内存地址空间也大大提高了可编程性。然而,ccNUMA架构需要复杂且昂贵的缓存一致性协议来在并行执行期间强制执行正确性,这将在系统中触发大量的片内和片外流量。本文分析了如何通过使用联合硬件/软件方法在大型真实ccNUMA平台中最好地约束相干流量。对于几个基准测试,我们详细研究了在目录协议中添加的分层缓存层以及运行时管理的numa感知调度和数据分配技术的影响下的一致性流量,以最有效地利用添加的硬件。与numa无关的调度和数据分配相比,这种联合方法的效率提高了1.23到2.54倍,相干流量减少了44%到77%。此外,我们表明,我们在运行时级别采用的numa感知技术对于确保目录一致性协议中添加的分层层不会向系统引入显著的一致性流量至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信