Improving support for locality and fine-grain sharing in chip multiprocessors

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI:10.1145/1454115.1454138

Hemayet Hossain, S. Dwarkadas, Michael C. Huang

{"title":"Improving support for locality and fine-grain sharing in chip multiprocessors","authors":"Hemayet Hossain, S. Dwarkadas, Michael C. Huang","doi":"10.1145/1454115.1454138","DOIUrl":null,"url":null,"abstract":"Both commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer's primary cache when accessed. Chip multiprocessors present an opportunity to optimize for fine-grain sharing using direct access to remote processor components through low-latency on-chip interconnects. In this paper, we present Adaptive Replication, Migration, and producer-Consumer Optimization (ARMCO), a coherence protocol that, to the best of our knowledge, is the first to exploit direct access to the L1 caches of remote processors (rather than via coherence mechanisms) in order to support fine-grain sharing. Our goal is to provide support for tightly coupled sharing by recognizing and adapting to common sharing patterns such as migratory, producer-consumer, multiple-reader, and multiple read-write. The protocol places data close to where it is most needed and leverages direct access when following conventional coherence actions proves wasteful. Via targeted optimizations for each of these access patterns, our proposed protocol is able to reduce the average access latency and increase the effective cache capacity at the L1 level with on-chip storage overhead as low as 0.38%. Full-system simulations of 16-processor CMPs show an average (geometric mean) speedup of 1.13 (ranging from 1.04 to 2.26) for 12 commercial, scientific, and mining workloads, with an average of 1.18 if we include 2 microbenchmarks. ARMCO also reduces the on-chip bandwidth requirements and dynamic energy (power) consumption by an average of 33.3% and 31.2% (20.2%) respectively. By evaluating optimizations at both the L1 and the L2 level, we demonstrate that when considering performance, optimization at the L1 level is more effective at supporting fine-grain sharing than that at the L2 level.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"31","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1454115.1454138","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 31

Abstract

Both commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer's primary cache when accessed. Chip multiprocessors present an opportunity to optimize for fine-grain sharing using direct access to remote processor components through low-latency on-chip interconnects. In this paper, we present Adaptive Replication, Migration, and producer-Consumer Optimization (ARMCO), a coherence protocol that, to the best of our knowledge, is the first to exploit direct access to the L1 caches of remote processors (rather than via coherence mechanisms) in order to support fine-grain sharing. Our goal is to provide support for tightly coupled sharing by recognizing and adapting to common sharing patterns such as migratory, producer-consumer, multiple-reader, and multiple read-write. The protocol places data close to where it is most needed and leverages direct access when following conventional coherence actions proves wasteful. Via targeted optimizations for each of these access patterns, our proposed protocol is able to reduce the average access latency and increase the effective cache capacity at the L1 level with on-chip storage overhead as low as 0.38%. Full-system simulations of 16-processor CMPs show an average (geometric mean) speedup of 1.13 (ranging from 1.04 to 2.26) for 12 commercial, scientific, and mining workloads, with an average of 1.18 if we include 2 microbenchmarks. ARMCO also reduces the on-chip bandwidth requirements and dynamic energy (power) consumption by an average of 33.3% and 31.2% (20.2%) respectively. By evaluating optimizations at both the L1 and the L2 level, we demonstrate that when considering performance, optimization at the L1 level is more effective at supporting fine-grain sharing than that at the L2 level.

查看原文本刊更多论文

改进芯片多处理器对局部性和细粒度共享的支持

商业和科学工作负载都受益于并发性和跨线程/进程的数据共享。生成的共享模式通常是细粒度的，访问时修改的缓存行仍然驻留在写入器的主缓存中。芯片多处理器通过低延迟片上互连直接访问远程处理器组件，提供了优化细粒度共享的机会。在本文中，我们提出了自适应复制、迁移和生产者-消费者优化(ARMCO)，据我们所知，这是第一个利用直接访问远程处理器的L1缓存(而不是通过一致性机制)来支持细粒度共享的一致性协议。我们的目标是通过识别和适应常见的共享模式(如迁移、生产者-消费者、多个阅读器和多个读写)来提供对紧密耦合共享的支持。该协议将数据放在最需要的地方，并在遵循常规一致性操作被证明是浪费的情况下利用直接访问。通过对这些访问模式中的每一种进行有针对性的优化，我们提出的协议能够减少平均访问延迟，并在片上存储开销低至0.38%的情况下增加L1级的有效缓存容量。16处理器cmp的全系统模拟显示，对于12个商业、科学和采矿工作负载，平均(几何平均)加速速度为1.13(范围从1.04到2.26)，如果我们包括2个微基准测试，平均速度为1.18。ARMCO还将片上带宽需求和动态能量(功率)消耗平均分别降低33.3%和31.2%(20.2%)。通过评估L1和L2级别的优化，我们证明了在考虑性能时，L1级别的优化在支持细粒度共享方面比L2级别的优化更有效。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

自引率

0.00%

发文量