A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs

2010 39th International Conference on Parallel Processing Pub Date : 2010-09-13 DOI:10.1109/ICPP.2010.34

José L. Abellán, Juan Fernández, M. Acacio

{"title":"A G-Line-Based Network for Fast and Efficient Barrier Synchronization in Many-Core CMPs","authors":"José L. Abellán, Juan Fernández, M. Acacio","doi":"10.1109/ICPP.2010.34","DOIUrl":null,"url":null,"abstract":"Barrier synchronization in shared memory parallel machines has been widely implemented through busy-waiting on shared variables. However, typical implementations of barrier synchronization tend to produce hot-spots in terms of memory and network contention, thus creating performance bottlenecks that become markedly more pronounced as the number of cores or processors increases. To overcome such limitations, we present a novel hardware-based barrier mechanism in the context of many-core CMPs. Our proposal is based on global interconnection lines (G-lines) and the S-CSMA technique, which have been recently used to enhance a flow control mechanism (EVC) in the context of networks-on-chip. Based on this technology, we have designed a simple and scalable G-line-based network that operates independently of the main data network, and that is aimed at carrying out barrier synchronizations efficiently. In the ideal case, our design takes only 4 cycles to perform a barrier synchronization once all cores or threads have arrived at the barrier. As a proof of concept, we examine the benefits of our proposal by comparing it with one of the best software approaches (a binary combining-tree barrier). To do so, we run several kernels and scientific applications on top of the Sim-PowerCMP performance simulator that models a 32-core CMP with a 2D-mesh network configuration. Our proposal entails average reductions in terms of execution time of 68% and 21% for kernels and scientific applications, respectively. Additionally, network traffic is also lowered by 74% and 18%, respectively.","PeriodicalId":180554,"journal":{"name":"2010 39th International Conference on Parallel Processing","volume":"58 9","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 39th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2010.34","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

Barrier synchronization in shared memory parallel machines has been widely implemented through busy-waiting on shared variables. However, typical implementations of barrier synchronization tend to produce hot-spots in terms of memory and network contention, thus creating performance bottlenecks that become markedly more pronounced as the number of cores or processors increases. To overcome such limitations, we present a novel hardware-based barrier mechanism in the context of many-core CMPs. Our proposal is based on global interconnection lines (G-lines) and the S-CSMA technique, which have been recently used to enhance a flow control mechanism (EVC) in the context of networks-on-chip. Based on this technology, we have designed a simple and scalable G-line-based network that operates independently of the main data network, and that is aimed at carrying out barrier synchronizations efficiently. In the ideal case, our design takes only 4 cycles to perform a barrier synchronization once all cores or threads have arrived at the barrier. As a proof of concept, we examine the benefits of our proposal by comparing it with one of the best software approaches (a binary combining-tree barrier). To do so, we run several kernels and scientific applications on top of the Sim-PowerCMP performance simulator that models a 32-core CMP with a 2D-mesh network configuration. Our proposal entails average reductions in terms of execution time of 68% and 21% for kernels and scientific applications, respectively. Additionally, network traffic is also lowered by 74% and 18%, respectively.

查看原文本刊更多论文

基于g线的多核cmp快速高效屏障同步网络

在共享内存并行机中，屏障同步是通过对共享变量的忙碌等待来实现的。然而，屏障同步的典型实现往往会在内存和网络争用方面产生热点，从而产生性能瓶颈，随着内核或处理器数量的增加，性能瓶颈会变得更加明显。为了克服这些限制，我们提出了一种新的基于硬件的多核cmp屏障机制。我们的建议是基于全球互连线(g线)和S-CSMA技术，这两种技术最近被用于增强片上网络背景下的流量控制机制(EVC)。基于该技术，我们设计了一个简单且可扩展的基于g线的网络，该网络独立于主数据网络运行，旨在有效地进行屏障同步。在理想情况下，一旦所有内核或线程到达barrier，我们的设计只需要4个周期来执行barrier同步。作为概念的证明，我们通过将我们的建议与最好的软件方法之一(二叉组合树屏障)进行比较来检查其好处。为此，我们在Sim-PowerCMP性能模拟器上运行了几个内核和科学应用程序，该模拟器模拟了一个具有2d网格网络配置的32核CMP。我们的建议将内核和科学应用程序的执行时间分别平均减少68%和21%。此外，网络流量也分别降低了74%和18%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 39th International Conference on Parallel Processing

自引率

0.00%

发文量