On-chip communication and synchronization mechanisms with cache-integrated network interfaces

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI:10.1145/1787275.1787328

S. Kavadias, M. Katevenis, M. Zampetakis, Dimitrios S. Nikolopoulos

{"title":"On-chip communication and synchronization mechanisms with cache-integrated network interfaces","authors":"S. Kavadias, M. Katevenis, M. Zampetakis, Dimitrios S. Nikolopoulos","doi":"10.1145/1787275.1787328","DOIUrl":null,"url":null,"abstract":"Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized NI functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multi-word blocks through RDMA copy. Furthermore, we introduce event responses, as a mechanism for software configurable synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and evaluated the on-chip communication performance on the prototype as well as on a CMP simulator with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"200 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"35","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th ACM international conference on Computing frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1787275.1787328","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 35

Abstract

Per-core local (scratchpad) memories allow direct inter-core communication, with latency and energy advantages over coherent cache-based communication, especially as CMP architectures become more distributed. We have designed cache-integrated network interfaces (NIs), appropriate for scalable multicores, that combine the best of two worlds the flexibility of caches and the efficiency of scratchpad memories: on-chip SRAM is configurably shared among caching, scratchpad, and virtualized NI functions. This paper presents our architecture, which provides local and remote scratchpad access, to either individual words or multi-word blocks through RDMA copy. Furthermore, we introduce event responses, as a mechanism for software configurable synchronization primitives. We present three event response mechanisms that expose NI functionality to software, for multiword transfer initiation, memory barriers for explicitly-selected accesses of arbitrary size, and multi-party synchronization queues. We implemented these mechanisms in a four-core FPGA prototype, and evaluated the on-chip communication performance on the prototype as well as on a CMP simulator with up to 128 cores. We demonstrate efficient synchronization, low-overhead communication, and amortized-overhead bulk transfers, which allow parallelization gains for fine-grain tasks, and efficient exploitation of the hardware bandwidth.

查看原文本刊更多论文

片上通信和同步机制与缓存集成的网络接口

每核本地(刮板)存储器允许直接的核间通信，与基于一致缓存的通信相比，具有延迟和能量优势，特别是在CMP体系结构变得更加分布式的情况下。我们设计了适合可扩展多核的缓存集成网络接口(NIs)，它结合了两个世界的优点:缓存的灵活性和临时存储器的效率:片上SRAM可在缓存、临时存储器和虚拟NI功能之间配置共享。本文介绍了我们的架构，该架构通过RDMA复制提供对单个单词或多单词块的本地和远程刮擦板访问。此外，我们引入事件响应，作为软件可配置同步原语的机制。我们提出了三种事件响应机制，将NI功能暴露给软件，用于多字传输启动，用于显式选择任意大小访问的内存屏障，以及多方同步队列。我们在四核FPGA原型中实现了这些机制，并在原型和高达128核的CMP模拟器上评估了片上通信性能。我们演示了高效的同步、低开销的通信和平摊开销的批量传输，这允许细粒度任务的并行化增益，并有效地利用硬件带宽。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 7th ACM international conference on Computing frontiers

自引率

0.00%

发文量