RAPID: Memory-Aware NoC for Latency Optimized GPGPU Architectures

IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2018-09-23 DOI:10.1109/TMSCS.2018.2871094

Venkata Yaswanth Raparti;Sudeep Pasricha

{"title":"RAPID: Memory-Aware NoC for Latency Optimized GPGPU Architectures","authors":"Venkata Yaswanth Raparti;Sudeep Pasricha","doi":"10.1109/TMSCS.2018.2871094","DOIUrl":null,"url":null,"abstract":"The growing parallelism in most of today's applications has led to an increased demand for parallel computing in processors. General Purpose Graphics Processing Units (GPGPUs) have been used extensively to support highly parallel applications in recent years. Such GPGPUs generate huge volumes of network traffic between memory controllers (MCs) and shader cores. As a result, the network-on-chip (NoC) fabric can become a performance bottleneck, especially for memory intensive applications running on GPGPUs. Traditional mesh-based NoC topologies are not suitable for GPGPUs as they possess high network latency that leads to congestion at MCs and an increase in application execution time. In this article, we propose a novel memory-aware NoC that has two (request and reply) planes tailored to exploit the traffic characteristics in GPGPUs. The request layer consists of low power, and low latency routers that are optimized for the many-to-few traffic pattern. In the reply layer, flits are sent on fast overlay circuits to reach their destinations in just three cycles (at 1 GHz). In addition, as traditional memory controllers are not aware of the application memory intensity that leads to higher waiting time for applications on the shader cores, we propose an enhanced memory controller that prioritizes burst packets to improve application performance on GPGPUs. Experimental results indicate that our framework yields an improvement of \n<inline-formula><tex-math>${\\mathrm{4}}-{\\mathrm{10}}\\times$</tex-math></inline-formula>\n in NoC latency, up to 63 percent in execution time, and up to 4× in total energy consumption compared to the state-of-the-art.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"874-887"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2871094","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multi-Scale Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/8470113/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

The growing parallelism in most of today's applications has led to an increased demand for parallel computing in processors. General Purpose Graphics Processing Units (GPGPUs) have been used extensively to support highly parallel applications in recent years. Such GPGPUs generate huge volumes of network traffic between memory controllers (MCs) and shader cores. As a result, the network-on-chip (NoC) fabric can become a performance bottleneck, especially for memory intensive applications running on GPGPUs. Traditional mesh-based NoC topologies are not suitable for GPGPUs as they possess high network latency that leads to congestion at MCs and an increase in application execution time. In this article, we propose a novel memory-aware NoC that has two (request and reply) planes tailored to exploit the traffic characteristics in GPGPUs. The request layer consists of low power, and low latency routers that are optimized for the many-to-few traffic pattern. In the reply layer, flits are sent on fast overlay circuits to reach their destinations in just three cycles (at 1 GHz). In addition, as traditional memory controllers are not aware of the application memory intensity that leads to higher waiting time for applications on the shader cores, we propose an enhanced memory controller that prioritizes burst packets to improve application performance on GPGPUs. Experimental results indicate that our framework yields an improvement of

${\mathrm{4}}-{\mathrm{10}}\times$

in NoC latency, up to 63 percent in execution time, and up to 4× in total energy consumption compared to the state-of-the-art.

查看原文本刊更多论文

RAPID：用于延迟优化GPGPU体系结构的内存感知NoC

当今大多数应用程序中日益增长的并行性导致了对处理器中并行计算的需求增加。近年来，通用图形处理单元（GPGPU）被广泛用于支持高度并行的应用。这样的GPGPU在内存控制器（MC）和着色器核心之间产生大量的网络流量。因此，片上网络（NoC）结构可能会成为性能瓶颈，尤其是对于运行在GPGPU上的内存密集型应用程序。传统的基于网格的NoC拓扑不适合GPGPU，因为它们具有高网络延迟，这导致MC处的拥塞和应用程序执行时间的增加。在本文中，我们提出了一种新的内存感知NoC，它有两个（请求和回复）平面，专门用于利用GPGPU中的流量特性。请求层由低功耗、低延迟的路由器组成，这些路由器针对多到少流量模式进行了优化。在应答层，微片在快速覆盖电路上发送，仅需三个周期（1GHz）即可到达目的地。此外，由于传统的内存控制器不知道应用程序内存强度，这会导致着色器核心上的应用程序等待时间更高，因此我们提出了一种增强的内存控制器，该控制器对突发数据包进行优先级排序，以提高GPGPU上的应用性能。实验结果表明，与现有技术相比，我们的框架在NoC延迟方面提高了${\mathrm{4}}-{\mathrm{10}}\times$，执行时间提高了63%，总能耗提高了4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multi-Scale Computing Systems

自引率

0.00%

发文量