Reducing the latency of L2 misses in shared-memory multiprocessors through on-chip directory integration

Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing Pub Date : 2002-01-09 DOI:10.1109/EMPDP.2002.994312

M. Acacio, José González, José M. García, J. Duato

{"title":"Reducing the latency of L2 misses in shared-memory multiprocessors through on-chip directory integration","authors":"M. Acacio, José González, José M. García, J. Duato","doi":"10.1109/EMPDP.2002.994312","DOIUrl":null,"url":null,"abstract":"Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller and the network interface. In this paper, we exploit such an integration scale, presenting a new three-level directory architecture aimed at reducing the long L2 miss latencies and the memory overhead that characterize cc-NUMA machines and limit their scalability. The proposed architecture is based on the integration into the processor chip of the directory controller and a small first-level directory cache that stores precise information for the most recently referenced memory lines, as the means to reduce miss latencies. The second- and third-level directories are located near the main memory and they are only accessed when a directory entry for a certain memory line is not present in the first-level directory. This off-chip structure achieves the performance of a large and non-scalable full-map directory with a very significant reduction in the memory overhead. Using execution-driven simulations, we show that substantial latency reductions can be obtained by using the proposed directory architecture. Load, store and read-modify-write misses are significantly accelerated (latency reductions of more than 35% in some cases). These reductions translate into important improvements on the final application performance (reductions up to 20% in execution time).","PeriodicalId":126071,"journal":{"name":"Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EMPDP.2002.994312","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller and the network interface. In this paper, we exploit such an integration scale, presenting a new three-level directory architecture aimed at reducing the long L2 miss latencies and the memory overhead that characterize cc-NUMA machines and limit their scalability. The proposed architecture is based on the integration into the processor chip of the directory controller and a small first-level directory cache that stores precise information for the most recently referenced memory lines, as the means to reduce miss latencies. The second- and third-level directories are located near the main memory and they are only accessed when a directory entry for a certain memory line is not present in the first-level directory. This off-chip structure achieves the performance of a large and non-scalable full-map directory with a very significant reduction in the memory overhead. Using execution-driven simulations, we show that substantial latency reductions can be obtained by using the proposed directory architecture. Load, store and read-modify-write misses are significantly accelerated (latency reductions of more than 35% in some cases). These reductions translate into important improvements on the final application performance (reductions up to 20% in execution time).

查看原文本刊更多论文

通过片上目录集成减少共享内存多处理器中L2缺失的延迟

最近的技术改进允许多处理器设计者将一些关键组件放入处理器芯片中，例如内存控制器和网络接口。在本文中，我们利用这种集成规模，提出了一种新的三层目录体系结构，旨在减少长L2缺失延迟和内存开销，这些都是cc-NUMA机器的特征，并限制了它们的可扩展性。所提出的架构是基于集成到处理器芯片的目录控制器和一个小的第一级目录缓存，该缓存存储最近引用的内存行的精确信息，作为减少遗漏延迟的手段。第二级和第三级目录位于主内存附近，只有在第一级目录中没有某个内存行的目录条目时才能访问它们。这种片外结构实现了大型且不可伸缩的全映射目录的性能，同时大大减少了内存开销。通过使用执行驱动的模拟，我们证明了通过使用提出的目录体系结构可以大大减少延迟。加载、存储和读-修改-写失误显著加快(在某些情况下延迟减少超过35%)。这些减少转化为对最终应用程序性能的重要改进(执行时间最多减少20%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing

自引率

0.00%

发文量