Extending CC-NUMA systems to support write update optimizations

Liqun Cheng, J. Carter
{"title":"Extending CC-NUMA systems to support write update optimizations","authors":"Liqun Cheng, J. Carter","doi":"10.1145/1413370.1413401","DOIUrl":null,"url":null,"abstract":"Processor stalls and protocol messages caused by coherence misses limit the performance of shared memory applications. Modern multiprocessors employ write-invalidate coherence protocols, which induce read misses to ensure consistency. Previous research has shown that an invalidate protocol is not optimal for all memory access patterns - an update protocol can significantly outperform an invalidate protocol when data is heavily shared or accessed in predictable patterns. However, update protocols can generate excessive network traffic and are difficult to build on a scalable (non-bus) interconnect. To obtain the benefits of both invalidate and update protocols, we built a speculative sequentially consistent write- update mechanism on top of a write-invalidate protocol. To ensure coherence, a processor wishing to write to a block of data uses a traditional write-invalidate protocol to obtain exclusive access to the block before modifying it. To improve performance, the writing processor can later self- downgrade the modified block to the shared state and flush it back to its home node, which forwards the new data to processors that it predicts are likely to consume the data. We present a practical and cost-effective design for extending CC-NUMA systems to support this speculative update mechanism that requires no changes to the processor core, bus interface, or memory consistency model. We also present two hardware-efficient mechanisms for detecting access patterns that benefit from the speculative update mechanism, stable reader set and stream. We evaluate our update mechanisms on a wide range of scientific benchmarks and commercial applications. Using a cycle-accurate execution-driven simulator of a future 16-node SGI multiprocessor, we find that the mechanisms proposed in this paper reduce the average remote miss rate by 30%, reduce network traffic by 15%, and improve performance by 10%, and in no case hurt performance.","PeriodicalId":230761,"journal":{"name":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1413370.1413401","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8

Abstract

Processor stalls and protocol messages caused by coherence misses limit the performance of shared memory applications. Modern multiprocessors employ write-invalidate coherence protocols, which induce read misses to ensure consistency. Previous research has shown that an invalidate protocol is not optimal for all memory access patterns - an update protocol can significantly outperform an invalidate protocol when data is heavily shared or accessed in predictable patterns. However, update protocols can generate excessive network traffic and are difficult to build on a scalable (non-bus) interconnect. To obtain the benefits of both invalidate and update protocols, we built a speculative sequentially consistent write- update mechanism on top of a write-invalidate protocol. To ensure coherence, a processor wishing to write to a block of data uses a traditional write-invalidate protocol to obtain exclusive access to the block before modifying it. To improve performance, the writing processor can later self- downgrade the modified block to the shared state and flush it back to its home node, which forwards the new data to processors that it predicts are likely to consume the data. We present a practical and cost-effective design for extending CC-NUMA systems to support this speculative update mechanism that requires no changes to the processor core, bus interface, or memory consistency model. We also present two hardware-efficient mechanisms for detecting access patterns that benefit from the speculative update mechanism, stable reader set and stream. We evaluate our update mechanisms on a wide range of scientific benchmarks and commercial applications. Using a cycle-accurate execution-driven simulator of a future 16-node SGI multiprocessor, we find that the mechanisms proposed in this paper reduce the average remote miss rate by 30%, reduce network traffic by 15%, and improve performance by 10%, and in no case hurt performance.
扩展CC-NUMA系统以支持写更新优化
由一致性缺失引起的处理器停滞和协议消息限制了共享内存应用程序的性能。现代多处理器采用写无效一致性协议,该协议诱导读失败以确保一致性。以前的研究表明,invalidate协议并不是所有内存访问模式的最佳选择——当数据被大量共享或以可预测的模式访问时,更新协议可以显著优于invalidate协议。然而,更新协议可能会产生过多的网络流量,并且难以在可扩展(非总线)互连上构建。为了获得invalidate协议和update协议的优点,我们在write-invalidate协议之上构建了一个推测性的顺序一致的写更新机制。为了确保一致性,希望写入数据块的处理器使用传统的write-invalidate协议在修改数据块之前获得对该块的独占访问权。为了提高性能,写处理器可以稍后将修改后的块自降级为共享状态,并将其刷新回主节点,主节点将新数据转发给它预测可能会使用这些数据的处理器。我们提出了一种实用且经济有效的设计,用于扩展CC-NUMA系统以支持这种推测更新机制,该机制不需要更改处理器核心、总线接口或内存一致性模型。我们还提出了两种硬件高效的机制来检测访问模式,这两种机制受益于推测更新机制、稳定的读取器集和流。我们在广泛的科学基准和商业应用上评估我们的更新机制。在未来16节点SGI多处理器的周期精确执行驱动模拟器上,我们发现本文提出的机制将平均远程失分率降低了30%,网络流量减少了15%,性能提高了10%,而且在任何情况下都不会影响性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信