Extending CC-NUMA systems to support write update optimizations

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2008-11-15 DOI:10.1145/1413370.1413401

Liqun Cheng, J. Carter

{"title":"Extending CC-NUMA systems to support write update optimizations","authors":"Liqun Cheng, J. Carter","doi":"10.1145/1413370.1413401","DOIUrl":null,"url":null,"abstract":"Processor stalls and protocol messages caused by coherence misses limit the performance of shared memory applications. Modern multiprocessors employ write-invalidate coherence protocols, which induce read misses to ensure consistency. Previous research has shown that an invalidate protocol is not optimal for all memory access patterns - an update protocol can significantly outperform an invalidate protocol when data is heavily shared or accessed in predictable patterns. However, update protocols can generate excessive network traffic and are difficult to build on a scalable (non-bus) interconnect. To obtain the benefits of both invalidate and update protocols, we built a speculative sequentially consistent write- update mechanism on top of a write-invalidate protocol. To ensure coherence, a processor wishing to write to a block of data uses a traditional write-invalidate protocol to obtain exclusive access to the block before modifying it. To improve performance, the writing processor can later self- downgrade the modified block to the shared state and flush it back to its home node, which forwards the new data to processors that it predicts are likely to consume the data. We present a practical and cost-effective design for extending CC-NUMA systems to support this speculative update mechanism that requires no changes to the processor core, bus interface, or memory consistency model. We also present two hardware-efficient mechanisms for detecting access patterns that benefit from the speculative update mechanism, stable reader set and stream. We evaluate our update mechanisms on a wide range of scientific benchmarks and commercial applications. Using a cycle-accurate execution-driven simulator of a future 16-node SGI multiprocessor, we find that the mechanisms proposed in this paper reduce the average remote miss rate by 30%, reduce network traffic by 15%, and improve performance by 10%, and in no case hurt performance.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1413370.1413401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

Processor stalls and protocol messages caused by coherence misses limit the performance of shared memory applications. Modern multiprocessors employ write-invalidate coherence protocols, which induce read misses to ensure consistency. Previous research has shown that an invalidate protocol is not optimal for all memory access patterns - an update protocol can significantly outperform an invalidate protocol when data is heavily shared or accessed in predictable patterns. However, update protocols can generate excessive network traffic and are difficult to build on a scalable (non-bus) interconnect. To obtain the benefits of both invalidate and update protocols, we built a speculative sequentially consistent write- update mechanism on top of a write-invalidate protocol. To ensure coherence, a processor wishing to write to a block of data uses a traditional write-invalidate protocol to obtain exclusive access to the block before modifying it. To improve performance, the writing processor can later self- downgrade the modified block to the shared state and flush it back to its home node, which forwards the new data to processors that it predicts are likely to consume the data. We present a practical and cost-effective design for extending CC-NUMA systems to support this speculative update mechanism that requires no changes to the processor core, bus interface, or memory consistency model. We also present two hardware-efficient mechanisms for detecting access patterns that benefit from the speculative update mechanism, stable reader set and stream. We evaluate our update mechanisms on a wide range of scientific benchmarks and commercial applications. Using a cycle-accurate execution-driven simulator of a future 16-node SGI multiprocessor, we find that the mechanisms proposed in this paper reduce the average remote miss rate by 30%, reduce network traffic by 15%, and improve performance by 10%, and in no case hurt performance.

查看原文本刊更多论文

扩展CC-NUMA系统以支持写更新优化

由一致性缺失引起的处理器停滞和协议消息限制了共享内存应用程序的性能。现代多处理器采用写无效一致性协议，该协议诱导读失败以确保一致性。以前的研究表明，invalidate协议并不是所有内存访问模式的最佳选择——当数据被大量共享或以可预测的模式访问时，更新协议可以显著优于invalidate协议。然而，更新协议可能会产生过多的网络流量，并且难以在可扩展(非总线)互连上构建。为了获得invalidate协议和update协议的优点，我们在write-invalidate协议之上构建了一个推测性的顺序一致的写更新机制。为了确保一致性，希望写入数据块的处理器使用传统的write-invalidate协议在修改数据块之前获得对该块的独占访问权。为了提高性能，写处理器可以稍后将修改后的块自降级为共享状态，并将其刷新回主节点，主节点将新数据转发给它预测可能会使用这些数据的处理器。我们提出了一种实用且经济有效的设计，用于扩展CC-NUMA系统以支持这种推测更新机制，该机制不需要更改处理器核心、总线接口或内存一致性模型。我们还提出了两种硬件高效的机制来检测访问模式，这两种机制受益于推测更新机制、稳定的读取器集和流。我们在广泛的科学基准和商业应用上评估我们的更新机制。在未来16节点SGI多处理器的周期精确执行驱动模拟器上，我们发现本文提出的机制将平均远程失分率降低了30%，网络流量减少了15%，性能提高了10%，而且在任何情况下都不会影响性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

自引率

0.00%

发文量