RC3:基于RC扩展的x86-64定向缓存一致性

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI:10.1109/PACT.2015.37

M. Elver, V. Nagarajan

{"title":"RC3:基于RC扩展的x86-64定向缓存一致性","authors":"M. Elver, V. Nagarajan","doi":"10.1109/PACT.2015.37","DOIUrl":null,"url":null,"abstract":"The recent convergence towards programming language based memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. In effect, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86-64) cannot readily make use of lazy coherence protocols without either: changing the architecture's consistency model to (a variant of) RC at the expense of backwards compatibility, or adapting the protocol to satisfy the stricter consistency model, thereby failing to benefit from synchronization information. We show an approach for the x86-64 architecture, which is a compromise between the two. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backwards compatibility with legacy codes and older microarchitectures. Second, we propose RC3, a scalable hardware cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers, and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3's storage overhead per cache line scales logarithmically with increasing core count, and reduces on-chip coherence storage overheads by 45% compared to a related approach specifically targeting TSO.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"RC3: Consistency Directed Cache Coherence for x86-64 with RC Extensions\",\"authors\":\"M. Elver, V. Nagarajan\",\"doi\":\"10.1109/PACT.2015.37\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The recent convergence towards programming language based memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. In effect, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86-64) cannot readily make use of lazy coherence protocols without either: changing the architecture's consistency model to (a variant of) RC at the expense of backwards compatibility, or adapting the protocol to satisfy the stricter consistency model, thereby failing to benefit from synchronization information. We show an approach for the x86-64 architecture, which is a compromise between the two. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backwards compatibility with legacy codes and older microarchitectures. Second, we propose RC3, a scalable hardware cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers, and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3's storage overhead per cache line scales logarithmically with increasing core count, and reduces on-chip coherence storage overheads by 45% compared to a related approach specifically targeting TSO.\",\"PeriodicalId\":385398,\"journal\":{\"name\":\"2015 International Conference on Parallel Architecture and Compilation (PACT)\",\"volume\":\"15 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-10-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Conference on Parallel Architecture and Compilation (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.2015.37\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.37","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

最近对基于编程语言的内存一致性模型的趋同引发了对延迟缓存一致性协议的新兴趣。这些协议利用同步信息，只在同步边界通过自我失效来强制一致性。实际上，这样的协议不需要共享器跟踪，这有利于可伸缩性。缺点是，这样的协议只容易适用于一组受限制的一致性模型，例如Release consistency (RC)，它显式地公开同步信息。特别是，具有更严格一致性模型的现有体系结构(例如x86-64)不能轻易地使用惰性一致性协议，除非采用以下两种方法:以向后兼容性为代价将体系结构的一致性模型更改为RC(一种变体)，或者调整协议以满足更严格的一致性模型，从而无法从同步信息中获益。我们展示了一种用于x86-64体系结构的方法，它是两者之间的折衷。首先，我们提出了一种通过简单的ISA扩展传递同步信息的机制，同时保留了与遗留代码和旧微体系结构的向后兼容性。其次，我们提出了RC3，一种可扩展的硬件缓存一致性协议，用于RCtso，由此产生的内存一致性模型。RC3不跟踪分享者，并且依赖于收购的自我失效。为了有效地满足RCtso，该协议仅使用每个l1时间戳来传递地减少自失效。RC3的性能比传统的惰性RC协议高出12%，在RC优化程序中实现了与MESI目录协议相当的性能。RC3的每条缓存线的存储开销随着核数的增加呈对数级增长，与专门针对TSO的相关方法相比，它将片上一致性存储开销降低了45%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

RC3: Consistency Directed Cache Coherence for x86-64 with RC Extensions

The recent convergence towards programming language based memory consistency models has sparked renewed interest in lazy cache coherence protocols. These protocols exploit synchronization information by enforcing coherence only at synchronization boundaries via self-invalidation. In effect, such protocols do not require sharer tracking which benefits scalability. On the downside, such protocols are only readily applicable to a restricted set of consistency models, such as Release Consistency (RC), which expose synchronization information explicitly. In particular, existing architectures with stricter consistency models (such as x86-64) cannot readily make use of lazy coherence protocols without either: changing the architecture's consistency model to (a variant of) RC at the expense of backwards compatibility, or adapting the protocol to satisfy the stricter consistency model, thereby failing to benefit from synchronization information. We show an approach for the x86-64 architecture, which is a compromise between the two. First, we propose a mechanism to convey synchronization information via a simple ISA extension, while retaining backwards compatibility with legacy codes and older microarchitectures. Second, we propose RC3, a scalable hardware cache coherence protocol for RCtso, the resulting memory consistency model. RC3 does not track sharers, and relies on self-invalidation on acquires. To satisfy RCtso efficiently, the protocol reduces self-invalidations transitively using per-L1 timestamps only. RC3 outperforms a conventional lazy RC protocol by 12%, achieving performance comparable to a MESI directory protocol for RC optimized programs. RC3's storage overhead per cache line scales logarithmically with increasing core count, and reduces on-chip coherence storage overheads by 45% compared to a related approach specifically targeting TSO.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量