Scalable SIMD-Efficient Graph Processing on GPUs

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI:10.1109/PACT.2015.15

Farzad Khorasani, Rajiv Gupta, L. Bhuyan

{"title":"Scalable SIMD-Efficient Graph Processing on GPUs","authors":"Farzad Khorasani, Rajiv Gupta, L. Bhuyan","doi":"10.1109/PACT.2015.15","DOIUrl":null,"url":null,"abstract":"The vast computing power of GPUs makes them an attractive platform for accelerating large scale data parallel computations such as popular graph processing applications. However, the inherent irregularity and large sizes of real-world power law graphs makes effective use of GPUs a major challenge. In this paper we develop techniques that greatly enhance the performance and scalability of vertex-centric graph processing on GPUs. First, we present Warp Segmentation, a novel method that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex with irregular-sized neighbors while employing compact CSR representation to maximize the graph size that can be kept inside the GPU global memory. Prior works can either maximize graph sizes (VWC uses the CSR representation) or device utilization (e.g., CuSha uses the CW representation, however, CW is roughly 2.5x the size of CSR). Second, we further scale graph processing to make use of multiple GPUs while proposing Vertex Refinement to address the challenge of judiciously using the limited bandwidth available for transferring data between GPUs via the PCIe bus. Vertex refinement employs parallel binary prefix sum to dynamically collect only the updated boundary vertices inside GPUs' outbox buffers for dramatically reducing inter-GPU data transfer volume. Whereas existing multi-GPU techniques (Medusa, TOTEM) perform high degree of wasteful vertex transfers. On a single GPU, our framework delivers average speedups of 1.29x to 2.80x over VWC. When scaled to multiple GPUs, our framework achieves up to 2.71x performance improvement compared to inter-GPU vertex communication schemes used by other multi-GPU techniques (i.e., Medusa, TOTEM).","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"109","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 109

Abstract

The vast computing power of GPUs makes them an attractive platform for accelerating large scale data parallel computations such as popular graph processing applications. However, the inherent irregularity and large sizes of real-world power law graphs makes effective use of GPUs a major challenge. In this paper we develop techniques that greatly enhance the performance and scalability of vertex-centric graph processing on GPUs. First, we present Warp Segmentation, a novel method that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex with irregular-sized neighbors while employing compact CSR representation to maximize the graph size that can be kept inside the GPU global memory. Prior works can either maximize graph sizes (VWC uses the CSR representation) or device utilization (e.g., CuSha uses the CW representation, however, CW is roughly 2.5x the size of CSR). Second, we further scale graph processing to make use of multiple GPUs while proposing Vertex Refinement to address the challenge of judiciously using the limited bandwidth available for transferring data between GPUs via the PCIe bus. Vertex refinement employs parallel binary prefix sum to dynamically collect only the updated boundary vertices inside GPUs' outbox buffers for dramatically reducing inter-GPU data transfer volume. Whereas existing multi-GPU techniques (Medusa, TOTEM) perform high degree of wasteful vertex transfers. On a single GPU, our framework delivers average speedups of 1.29x to 2.80x over VWC. When scaled to multiple GPUs, our framework achieves up to 2.71x performance improvement compared to inter-GPU vertex communication schemes used by other multi-GPU techniques (i.e., Medusa, TOTEM).

查看原文本刊更多论文

gpu上的可扩展simd高效图形处理

gpu的巨大计算能力使其成为加速大规模数据并行计算(如流行的图形处理应用程序)的有吸引力的平台。然而，现实世界幂律图固有的不规则性和大尺寸使得gpu的有效利用成为一个主要挑战。在本文中，我们开发的技术大大提高了gpu上以顶点为中心的图形处理的性能和可扩展性。首先，我们提出了Warp Segmentation，这是一种新的方法，通过动态分配适当数量的SIMD线程来处理具有不规则大小邻居的顶点，同时采用紧凑的CSR表示来最大化可以保存在GPU全局内存中的图形大小，从而大大提高GPU设备利用率。先前的工作可以最大化图的大小(VWC使用CSR表示)或设备利用率(例如，CuSha使用CW表示，然而，CW大约是CSR大小的2.5倍)。其次，我们进一步扩展图形处理以利用多个gpu，同时提出顶点细化，以解决通过PCIe总线在gpu之间传输数据时明智地使用有限带宽的挑战。顶点优化采用并行二进制前缀和来动态收集gpu发件箱缓冲区内更新的边界顶点，从而显著减少gpu间的数据传输量。而现有的多gpu技术(Medusa, TOTEM)执行高度浪费的顶点传输。在单个GPU上，我们的框架比VWC提供了1.29到2.80倍的平均速度。当扩展到多个gpu时，与其他多gpu技术(例如，Medusa, TOTEM)使用的gpu间顶点通信方案相比，我们的框架实现了高达2.71倍的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量