Software/Hardware Co-design of 3D NoC-based GPU Architectures for Accelerated Graph Computations

ACM Transactions on Design Automation of Electronic Systems (TODAES) Pub Date : 2022-04-04 DOI:10.1145/3514354

Dwaipayan Choudhury, Reet Barik, Aravind Sukumaran Rajam, A. Kalyanaraman, And Partha Pratim Pande

{"title":"Software/Hardware Co-design of 3D NoC-based GPU Architectures for Accelerated Graph Computations","authors":"Dwaipayan Choudhury, Reet Barik, Aravind Sukumaran Rajam, A. Kalyanaraman, And Partha Pratim Pande","doi":"10.1145/3514354","DOIUrl":null,"url":null,"abstract":"Manycore GPU architectures have become the mainstay for accelerating graph computations. One of the primary bottlenecks to performance of graph computations on manycore architectures is the data movement. Since most of the accesses in graph processing are due to vertex neighborhood lookups, locality in graph data structures plays a key role in dictating the degree of data movement. Vertex reordering is a widely used technique to improve data locality within graph data structures. However, these reordering schemes alone are not sufficient as they need to be complemented with efficient task allocation on manycore GPU architectures to reduce latency due to local cache misses. Consequently, in this article, we introduce a software/hardware co-design framework for accelerating graph computations. Our approach couples an architecture-aware vertex reordering with a priority-based task allocation technique. As the task allocation aims to reduce on-chip latency and associated energy, the choice of Network-on-Chip (NoC) as the communication backbone in the manycore platform is an important parameter. By leveraging emerging three-dimensional (3D) integration technology, we propose design of a small-world NoC (SWNoC)-enabled manycore GPU architecture, where the placement of the links connecting the streaming multiprocessors (SMs) and the memory controllers (MCs) follow a power-law distribution. The proposed 3D SWNoC-enabled software/hardware co-design framework achieves 11.1% to 22.9% performance improvement and 16.4% to 32.6% less energy consumption depending on the dataset and the graph application, when compared to the default order of dataset running on a conventional planar mesh architecture.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"1 1","pages":"1 - 22"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3514354","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Manycore GPU architectures have become the mainstay for accelerating graph computations. One of the primary bottlenecks to performance of graph computations on manycore architectures is the data movement. Since most of the accesses in graph processing are due to vertex neighborhood lookups, locality in graph data structures plays a key role in dictating the degree of data movement. Vertex reordering is a widely used technique to improve data locality within graph data structures. However, these reordering schemes alone are not sufficient as they need to be complemented with efficient task allocation on manycore GPU architectures to reduce latency due to local cache misses. Consequently, in this article, we introduce a software/hardware co-design framework for accelerating graph computations. Our approach couples an architecture-aware vertex reordering with a priority-based task allocation technique. As the task allocation aims to reduce on-chip latency and associated energy, the choice of Network-on-Chip (NoC) as the communication backbone in the manycore platform is an important parameter. By leveraging emerging three-dimensional (3D) integration technology, we propose design of a small-world NoC (SWNoC)-enabled manycore GPU architecture, where the placement of the links connecting the streaming multiprocessors (SMs) and the memory controllers (MCs) follow a power-law distribution. The proposed 3D SWNoC-enabled software/hardware co-design framework achieves 11.1% to 22.9% performance improvement and 16.4% to 32.6% less energy consumption depending on the dataset and the graph application, when compared to the default order of dataset running on a conventional planar mesh architecture.

查看原文本刊更多论文

用于加速图形计算的3D GPU架构的软硬件协同设计

多核GPU架构已经成为加速图形计算的主流。在多核架构中，图计算性能的主要瓶颈之一是数据移动。由于图处理中的大多数访问都是由于顶点邻域查找，因此图数据结构中的局部性在决定数据移动程度方面起着关键作用。顶点重排序是一种广泛使用的技术，用于改善图数据结构中的数据局部性。然而，这些重新排序方案本身是不够的，因为它们需要与多核GPU架构上的有效任务分配相辅相成，以减少由于本地缓存丢失而导致的延迟。因此，在本文中，我们介绍了一个加速图计算的软件/硬件协同设计框架。我们的方法结合了架构感知的顶点重排序和基于优先级的任务分配技术。由于任务分配的目的是降低片上延迟和相关能量，因此在多核平台中选择片上网络作为通信骨干网是一个重要的参数。通过利用新兴的三维(3D)集成技术，我们提出了一种支持小世界NoC (SWNoC)的多核GPU架构的设计，其中连接流多处理器(SMs)和内存控制器(mc)的链路的位置遵循幂律分布。与在传统平面网格架构上运行数据集的默认顺序相比，所提出的3D swnoc支持的软件/硬件协同设计框架实现了11.1%至22.9%的性能提升，并减少了16.4%至32.6%的能耗，具体取决于数据集和图形应用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Design Automation of Electronic Systems (TODAES)

自引率

0.00%

发文量