Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI:10.1109/ISPASS.2007.363754

Dhiraj D. Kalamkar, Mainak Chaudhuri, M. Heinrich

{"title":"Simplifying Active Memory Clusters by Leveraging Directory Protocol Threads","authors":"Dhiraj D. Kalamkar, Mainak Chaudhuri, M. Heinrich","doi":"10.1109/ISPASS.2007.363754","DOIUrl":null,"url":null,"abstract":"Address re-mapping techniques in so-called active memory systems have been shown to dramatically increase the performance of applications with poor cache and/or communication behavior on shared memory multiprocessors. However, these systems require custom hardware in the memory controller for cache line assembly/disassembly, address translation between re-mapped and normal addresses, and coherence logic. In this paper we make the important observation that on a traditional flexible distributed shared memory (DSM) multiprocessor node, equipped with a coherence protocol thread context as in SMTp or a simple dedicated in-order protocol processing core as in a CMP, the address re-mapping techniques can be implemented in software running on the protocol thread or core without custom hardware in the memory controller while delivering high performance. We implement the active memory address re-mapping techniques of parallel reduction and matrix transpose (two popular kernels in scientific, multimedia, and data mining applications) on these systems, outline the novel coherence protocol extensions needed to make them run efficiently in software protocols, and evaluate these protocols on four different DSM multiprocessor architectures with multi-threaded and/or dual-core nodes. The proposed protocol extensions yield speedup of 1.45 for parallel reduction and 1.29 for matrix transpose on a 16-node DSM multiprocessor when compared to non-active memory baseline systems and achieve performance comparable to the existing active memory architectures that rely on custom hardware in the memory controller","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS.2007.363754","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Address re-mapping techniques in so-called active memory systems have been shown to dramatically increase the performance of applications with poor cache and/or communication behavior on shared memory multiprocessors. However, these systems require custom hardware in the memory controller for cache line assembly/disassembly, address translation between re-mapped and normal addresses, and coherence logic. In this paper we make the important observation that on a traditional flexible distributed shared memory (DSM) multiprocessor node, equipped with a coherence protocol thread context as in SMTp or a simple dedicated in-order protocol processing core as in a CMP, the address re-mapping techniques can be implemented in software running on the protocol thread or core without custom hardware in the memory controller while delivering high performance. We implement the active memory address re-mapping techniques of parallel reduction and matrix transpose (two popular kernels in scientific, multimedia, and data mining applications) on these systems, outline the novel coherence protocol extensions needed to make them run efficiently in software protocols, and evaluate these protocols on four different DSM multiprocessor architectures with multi-threaded and/or dual-core nodes. The proposed protocol extensions yield speedup of 1.45 for parallel reduction and 1.29 for matrix transpose on a 16-node DSM multiprocessor when compared to non-active memory baseline systems and achieve performance comparable to the existing active memory architectures that rely on custom hardware in the memory controller

查看原文本刊更多论文

利用目录协议线程简化活动内存集群

在所谓的主动内存系统中，地址重新映射技术已经被证明可以显著提高共享内存多处理器上缓存和/或通信行为较差的应用程序的性能。然而，这些系统需要在内存控制器中定制硬件，用于缓存线的组装/拆卸、重新映射地址和正常地址之间的地址转换以及相干逻辑。在本文中，我们进行了重要的观察，在传统的灵活分布式共享内存(DSM)多处理器节点上，配备了一致性协议线程上下文(如SMTp)或简单的专用顺序协议处理核心(如CMP)，地址重新映射技术可以在协议线程或核心上运行的软件中实现，而无需在内存控制器中定制硬件，同时提供高性能。我们在这些系统上实现了并行约简和矩阵转置(两种在科学、多媒体和数据挖掘应用中流行的内核)的主动内存地址重映射技术，概述了使它们在软件协议中高效运行所需的新型相干协议扩展，并在四种不同的DSM多处理器架构上评估了这些协议，这些架构具有多线程和/或双核节点。与非活动内存基线系统相比，所提出的协议扩展在16节点DSM多处理器上的并行缩减和矩阵转置速度分别提高了1.45和1.29，并实现了与依赖于内存控制器中自定义硬件的现有活动内存架构相当的性能

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2007 IEEE International Symposium on Performance Analysis of Systems & Software

自引率

0.00%

发文量