Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling

2009 IEEE International Conference on Computer Design Pub Date : 2009-10-04 DOI:10.1109/ICCD.2009.5413143

Jiayuan Meng, K. Skadron

{"title":"Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling","authors":"Jiayuan Meng, K. Skadron","doi":"10.1109/ICCD.2009.5413143","DOIUrl":null,"url":null,"abstract":"Without high-bandwidth broadcast, large numbers of cores require a scalable point-to-point interconnect and a directory protocol. In such cases, a shared, inclusive last level cache (LLC) can improve data sharing and avoid three-way communication for shared reads. However, if inclusion encompasses thread-private data, two problems arise with the shared LLC. First, current memory allocators align stack bases on page boundaries, which emerges as a source of severe conflict misses for large numbers of threads on data-parallel applications. Second, correctness does not require the private data to reside in the shared directory or the LLC. This paper advocates stack-base randomization that eliminates the major source of conflict misses for large numbers of threads. However, when capacity becomes a limitation for the directory or last-level cache, this is not sufficient. We then propose non-inclusive, semi-coherent cache organization (NISC) that removes the requirement for inclusion of private data and reduces capacity misses. Our data-parallel benchmarks show that these limitations prevent scaling beyond 8 cores, while our techniques allow scaling to at least 32 cores for most benchmarks. At 8 cores, stack randomization provides a mean speedup of 1.2X, but stack randomization with 32 cores gives a speedup of 2.7X over the best baseline configuration. Comparing to conventional performance with a 2 MB LLC, our technique achieves similar performance with a 256 KB LLC, suggesting LLCs may be typically overprovisioned. When very limited LLC resources are available, NISC can further improve system performance by 1.8X.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"109 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"58","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Conference on Computer Design","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCD.2009.5413143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 58

Abstract

Without high-bandwidth broadcast, large numbers of cores require a scalable point-to-point interconnect and a directory protocol. In such cases, a shared, inclusive last level cache (LLC) can improve data sharing and avoid three-way communication for shared reads. However, if inclusion encompasses thread-private data, two problems arise with the shared LLC. First, current memory allocators align stack bases on page boundaries, which emerges as a source of severe conflict misses for large numbers of threads on data-parallel applications. Second, correctness does not require the private data to reside in the shared directory or the LLC. This paper advocates stack-base randomization that eliminates the major source of conflict misses for large numbers of threads. However, when capacity becomes a limitation for the directory or last-level cache, this is not sufficient. We then propose non-inclusive, semi-coherent cache organization (NISC) that removes the requirement for inclusion of private data and reduces capacity misses. Our data-parallel benchmarks show that these limitations prevent scaling beyond 8 cores, while our techniques allow scaling to at least 32 cores for most benchmarks. At 8 cores, stack randomization provides a mean speedup of 1.2X, but stack randomization with 32 cores gives a speedup of 2.7X over the best baseline configuration. Comparing to conventional performance with a 2 MB LLC, our technique achieves similar performance with a 256 KB LLC, suggesting LLCs may be typically overprovisioned. When very limited LLC resources are available, NISC can further improve system performance by 1.8X.

查看原文本刊更多论文

避免由于私有数据放置在多核扩展的最后一级缓存中而导致缓存抖动

没有高带宽广播，大量核心需要可扩展的点对点互连和目录协议。在这种情况下，共享的、包容的最后一级缓存(LLC)可以改善数据共享，并避免共享读的三方通信。但是，如果包含线程私有数据，那么共享LLC就会出现两个问题。首先，当前内存分配器根据页面边界对齐堆栈基，这对于数据并行应用程序上的大量线程来说是严重冲突缺失的来源。其次，正确性不要求私有数据驻留在共享目录或LLC中。本文提倡基于堆栈的随机化，这种随机化消除了大量线程冲突缺失的主要来源。但是，当容量成为目录或最后一级缓存的限制时，这是不够的。然后，我们提出了非包容性、半一致性缓存组织(NISC)，它消除了包含私有数据的要求，并减少了容量丢失。我们的数据并行基准测试表明，这些限制阻止扩展到8核以上，而我们的技术允许在大多数基准测试中扩展到至少32核。在8核的情况下，堆栈随机化提供了1.2倍的平均加速，但32核的堆栈随机化在最佳基线配置上提供了2.7倍的加速。与使用2 MB LLC的传统性能相比，我们的技术在使用256 KB LLC时实现了类似的性能，这表明LLC可能通常是过度配置的。当有限责任公司的资源非常有限时，NISC可以进一步提高系统性能1.8倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 IEEE International Conference on Computer Design

自引率

0.00%

发文量