通过带宽交换延迟来提高一致性协议的反应性

ACM International Conference on Computing Frontiers Pub Date : 2012-05-15 DOI:10.1145/2212908.2212929

L. G. Menezo, Valentin Puente, Pablo Abad Fidalgo, J. Gregorio

{"title":"通过带宽交换延迟来提高一致性协议的反应性","authors":"L. G. Menezo, Valentin Puente, Pablo Abad Fidalgo, J. Gregorio","doi":"10.1145/2212908.2212929","DOIUrl":null,"url":null,"abstract":"This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.","PeriodicalId":430420,"journal":{"name":"ACM International Conference on Computing Frontiers","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Improving coherence protocol reactiveness by trading bandwidth for latency\",\"authors\":\"L. G. Menezo, Valentin Puente, Pablo Abad Fidalgo, J. Gregorio\",\"doi\":\"10.1145/2212908.2212929\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.\",\"PeriodicalId\":430420,\"journal\":{\"name\":\"ACM International Conference on Computing Frontiers\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-05-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM International Conference on Computing Frontiers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2212908.2212929\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2212908.2212929","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

本文描述了如何利用片上网络的特殊性来提高一致性协议的响应性。为了实现这一目标，提出了一种新的相干协议LOCKE。LOCKE成功地利用了大的片上带宽可用性，以提高缓存相干芯片多处理器的性能和能源效率。假设互连网络被设计为支持组播流量，并且协议最大限度地发挥了直接相干带来的潜在优势，我们证明了基于组播的相干协议可以降低CMP内存层次中的能量需求。提出的关键思想是建立一个合适的片上网络吞吐量水平，通过两种方式加速同步:避免协议序列化，固有的基于目录的一致性协议，并减少平均访问时间比其他基于窥探的一致性协议，当共享数据真正竞争时。LOCKE是在令牌一致性性能基础上开发的，具有一组新的简单的主动策略，可以加速数据同步并消除被动令牌饥饿避免机制。使用一个完整的系统模拟器，真实地模拟片上互连，积极的核心架构和精确的内存层次结构细节，同时运行广泛的工作负载，我们的建议可以在能量和性能方面改进基于目录和基于令牌的一致性协议，至少在芯片中具有多达16个积极的乱序处理器的系统中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving coherence protocol reactiveness by trading bandwidth for latency

This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM International Conference on Computing Frontiers

自引率

0.00%

发文量