Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235) Pub Date : 1900-01-01 DOI:10.1145/279358.279403

V. Soundararajan, M. Heinrich, Ben Verghese, K. Gharachorloo, Anoop Gupta, J. Hennessy

{"title":"Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors","authors":"V. Soundararajan, M. Heinrich, Ben Verghese, K. Gharachorloo, Anoop Gupta, J. Hennessy","doi":"10.1145/279358.279403","DOIUrl":null,"url":null,"abstract":"Given the limitations of bus-based multiprocessors, CC-NUMA is the scalable architecture of choice for shared-memory machines. The most important characteristic of the CC-NUMA architecture is that the latency to access data on a remote node is considerably larger than the latency to access local memory. On such machines, good data locality can reduce memory stall time and is therefore a critical factor in application performance. In this paper we study the various options available to system designers to transparently decrease the fraction of data misses serviced remotely. This work is done in the context of the Stanford FLASH multiprocessor. FLASH is unique in that each node has a single pool of DRAM that can be used in a variety of ways by the programmable memory controller. We use the programmability of FLASH to explore different options for cache-coherence and data-locality in compute-server workloads. First, we consider two protocols for providing base cache-coherence, one with centralized directory information (dynamic pointer allocation) and another with distributed directory information (SCI). While several commercial systems are based on SCI, we find that a centralized scheme has superior performance. Next, we consider different hardware and software techniques that use some or all of the local memory in a node to improve data locality. Finally, we propose a hybrid scheme that combines hardware and software techniques. These schemes work on the same base platform with both user and kernel references from the workloads. The paper thus offers a realistic and fair comparison of replication/migration techniques that has not previously been feasible.","PeriodicalId":393075,"journal":{"name":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","volume":"176 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"62","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/279358.279403","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 62

Abstract

Given the limitations of bus-based multiprocessors, CC-NUMA is the scalable architecture of choice for shared-memory machines. The most important characteristic of the CC-NUMA architecture is that the latency to access data on a remote node is considerably larger than the latency to access local memory. On such machines, good data locality can reduce memory stall time and is therefore a critical factor in application performance. In this paper we study the various options available to system designers to transparently decrease the fraction of data misses serviced remotely. This work is done in the context of the Stanford FLASH multiprocessor. FLASH is unique in that each node has a single pool of DRAM that can be used in a variety of ways by the programmable memory controller. We use the programmability of FLASH to explore different options for cache-coherence and data-locality in compute-server workloads. First, we consider two protocols for providing base cache-coherence, one with centralized directory information (dynamic pointer allocation) and another with distributed directory information (SCI). While several commercial systems are based on SCI, we find that a centralized scheme has superior performance. Next, we consider different hardware and software techniques that use some or all of the local memory in a node to improve data locality. Finally, we propose a hybrid scheme that combines hardware and software techniques. These schemes work on the same base platform with both user and kernel references from the workloads. The paper thus offers a realistic and fair comparison of replication/migration techniques that has not previously been feasible.

查看原文本刊更多论文

在缓存一致的DSM多处理器中灵活地使用内存进行复制/迁移

考虑到基于总线的多处理器的局限性，CC-NUMA是共享内存机器的可扩展架构选择。CC-NUMA体系结构最重要的特征是访问远程节点上数据的延迟比访问本地内存的延迟要大得多。在这样的机器上，良好的数据局部性可以减少内存停顿时间，因此是影响应用程序性能的一个关键因素。在本文中，我们研究了系统设计人员可用的各种选项，以透明地减少远程服务数据丢失的比例。这项工作是在斯坦福大学FLASH多处理器的背景下完成的。FLASH的独特之处在于每个节点都有一个单一的DRAM池，可编程内存控制器可以以各种方式使用它。我们使用FLASH的可编程性来探索计算服务器工作负载中缓存一致性和数据局部性的不同选项。首先，我们考虑了两种提供基本缓存一致性的协议，一种是集中式目录信息(动态指针分配)，另一种是分布式目录信息(SCI)。虽然一些商业系统基于SCI，但我们发现集中式方案具有更好的性能。接下来，我们将考虑使用节点中的部分或全部本地内存来改进数据局部性的不同硬件和软件技术。最后，我们提出了一种结合硬件和软件技术的混合方案。这些方案使用来自工作负载的用户和内核引用在相同的基本平台上工作。因此，本文对以前不可行的复制/迁移技术进行了现实而公平的比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)

自引率

0.00%

发文量