Software/Hardware Co-design of 3D NoC-based GPU Architectures for Accelerated Graph Computations

Dwaipayan Choudhury, Reet Barik, Aravind Sukumaran Rajam, A. Kalyanaraman, And Partha Pratim Pande
{"title":"Software/Hardware Co-design of 3D NoC-based GPU Architectures for Accelerated Graph Computations","authors":"Dwaipayan Choudhury, Reet Barik, Aravind Sukumaran Rajam, A. Kalyanaraman, And Partha Pratim Pande","doi":"10.1145/3514354","DOIUrl":null,"url":null,"abstract":"Manycore GPU architectures have become the mainstay for accelerating graph computations. One of the primary bottlenecks to performance of graph computations on manycore architectures is the data movement. Since most of the accesses in graph processing are due to vertex neighborhood lookups, locality in graph data structures plays a key role in dictating the degree of data movement. Vertex reordering is a widely used technique to improve data locality within graph data structures. However, these reordering schemes alone are not sufficient as they need to be complemented with efficient task allocation on manycore GPU architectures to reduce latency due to local cache misses. Consequently, in this article, we introduce a software/hardware co-design framework for accelerating graph computations. Our approach couples an architecture-aware vertex reordering with a priority-based task allocation technique. As the task allocation aims to reduce on-chip latency and associated energy, the choice of Network-on-Chip (NoC) as the communication backbone in the manycore platform is an important parameter. By leveraging emerging three-dimensional (3D) integration technology, we propose design of a small-world NoC (SWNoC)-enabled manycore GPU architecture, where the placement of the links connecting the streaming multiprocessors (SMs) and the memory controllers (MCs) follow a power-law distribution. The proposed 3D SWNoC-enabled software/hardware co-design framework achieves 11.1% to 22.9% performance improvement and 16.4% to 32.6% less energy consumption depending on the dataset and the graph application, when compared to the default order of dataset running on a conventional planar mesh architecture.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"1 1","pages":"1 - 22"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3514354","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Manycore GPU architectures have become the mainstay for accelerating graph computations. One of the primary bottlenecks to performance of graph computations on manycore architectures is the data movement. Since most of the accesses in graph processing are due to vertex neighborhood lookups, locality in graph data structures plays a key role in dictating the degree of data movement. Vertex reordering is a widely used technique to improve data locality within graph data structures. However, these reordering schemes alone are not sufficient as they need to be complemented with efficient task allocation on manycore GPU architectures to reduce latency due to local cache misses. Consequently, in this article, we introduce a software/hardware co-design framework for accelerating graph computations. Our approach couples an architecture-aware vertex reordering with a priority-based task allocation technique. As the task allocation aims to reduce on-chip latency and associated energy, the choice of Network-on-Chip (NoC) as the communication backbone in the manycore platform is an important parameter. By leveraging emerging three-dimensional (3D) integration technology, we propose design of a small-world NoC (SWNoC)-enabled manycore GPU architecture, where the placement of the links connecting the streaming multiprocessors (SMs) and the memory controllers (MCs) follow a power-law distribution. The proposed 3D SWNoC-enabled software/hardware co-design framework achieves 11.1% to 22.9% performance improvement and 16.4% to 32.6% less energy consumption depending on the dataset and the graph application, when compared to the default order of dataset running on a conventional planar mesh architecture.
用于加速图形计算的3D GPU架构的软硬件协同设计
多核GPU架构已经成为加速图形计算的主流。在多核架构中,图计算性能的主要瓶颈之一是数据移动。由于图处理中的大多数访问都是由于顶点邻域查找,因此图数据结构中的局部性在决定数据移动程度方面起着关键作用。顶点重排序是一种广泛使用的技术,用于改善图数据结构中的数据局部性。然而,这些重新排序方案本身是不够的,因为它们需要与多核GPU架构上的有效任务分配相辅相成,以减少由于本地缓存丢失而导致的延迟。因此,在本文中,我们介绍了一个加速图计算的软件/硬件协同设计框架。我们的方法结合了架构感知的顶点重排序和基于优先级的任务分配技术。由于任务分配的目的是降低片上延迟和相关能量,因此在多核平台中选择片上网络作为通信骨干网是一个重要的参数。通过利用新兴的三维(3D)集成技术,我们提出了一种支持小世界NoC (SWNoC)的多核GPU架构的设计,其中连接流多处理器(SMs)和内存控制器(mc)的链路的位置遵循幂律分布。与在传统平面网格架构上运行数据集的默认顺序相比,所提出的3D swnoc支持的软件/硬件协同设计框架实现了11.1%至22.9%的性能提升,并减少了16.4%至32.6%的能耗,具体取决于数据集和图形应用程序。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信