On Analyzing Large Graphs Using GPUs

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI:10.1109/IPDPSW.2013.235

A. Chatterjee, S. Radhakrishnan, J. Antonio

{"title":"On Analyzing Large Graphs Using GPUs","authors":"A. Chatterjee, S. Radhakrishnan, J. Antonio","doi":"10.1109/IPDPSW.2013.235","DOIUrl":null,"url":null,"abstract":"Studying properties of graphs is essential to various applications, and recent growth of online social networks has spurred interests in analyzing their structures using Graphical Processing Units (GPUs). Utilizing the faster available shared memory on GPUs have provided tremendous speed-up for solving many general-purpose problems. However, when data required for processing is large and needs to be stored in the global memory instead of the shared memory, simultaneous memory accesses by threads in execution becomes the bottleneck for achieving higher throughput. In this paper, for storing large graphs, we propose and evaluate techniques to efficiently utilize the different levels of the memory hierarchy of GPUs, with the focus being on the larger global memory. Given a graph G = (V, E), we provide an algorithm to count the number of triangles in G, while storing the adjacency information on the global memory. Our computation techniques and data structure for retrieving the adjacency information is derived from processing the breadth-first-search tree of the input graph. Also, techniques to generate combinations of nodes for testing the properties of graphs induced by the same are discussed in detail. Our methods can be extended to solve other combinatorial counting problems on graphs, such as finding the number of connected sub graphs of size k, number of cliques (resp. independent sets) of size k, and related problems for large data sets. In the context of the triangle counting algorithm, we analyze and utilize primitives such as memory access coalescing and avoiding partition camping that offset the increase in access latency of using a slower but larger global memory. Our experimental results for the GPU implementation show at least 10 times speedup for triangle counting over the CPU counterpart. Another 6 - 8% increase in performance is obtained by utilizing the above mentioned primitives as compared to the naïve implementation of the program on the GPU.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2013.235","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

Studying properties of graphs is essential to various applications, and recent growth of online social networks has spurred interests in analyzing their structures using Graphical Processing Units (GPUs). Utilizing the faster available shared memory on GPUs have provided tremendous speed-up for solving many general-purpose problems. However, when data required for processing is large and needs to be stored in the global memory instead of the shared memory, simultaneous memory accesses by threads in execution becomes the bottleneck for achieving higher throughput. In this paper, for storing large graphs, we propose and evaluate techniques to efficiently utilize the different levels of the memory hierarchy of GPUs, with the focus being on the larger global memory. Given a graph G = (V, E), we provide an algorithm to count the number of triangles in G, while storing the adjacency information on the global memory. Our computation techniques and data structure for retrieving the adjacency information is derived from processing the breadth-first-search tree of the input graph. Also, techniques to generate combinations of nodes for testing the properties of graphs induced by the same are discussed in detail. Our methods can be extended to solve other combinatorial counting problems on graphs, such as finding the number of connected sub graphs of size k, number of cliques (resp. independent sets) of size k, and related problems for large data sets. In the context of the triangle counting algorithm, we analyze and utilize primitives such as memory access coalescing and avoiding partition camping that offset the increase in access latency of using a slower but larger global memory. Our experimental results for the GPU implementation show at least 10 times speedup for triangle counting over the CPU counterpart. Another 6 - 8% increase in performance is obtained by utilizing the above mentioned primitives as compared to the naïve implementation of the program on the GPU.

查看原文本刊更多论文

用gpu分析大图形

研究图形的性质对各种应用程序都是必不可少的，最近在线社交网络的增长激发了人们对使用图形处理单元(gpu)分析其结构的兴趣。利用gpu上更快的可用共享内存为解决许多通用问题提供了巨大的速度提升。但是，当处理所需的数据非常大并且需要存储在全局内存而不是共享内存中时，执行中的线程同时访问内存就成为实现更高吞吐量的瓶颈。在本文中，为了存储大型图形，我们提出并评估了有效利用gpu内存层次的不同级别的技术，重点是更大的全局内存。给定图G = (V, E)，我们提供了一种算法来计算G中三角形的数量，同时将邻接信息存储在全局内存中。我们用于检索邻接信息的计算技术和数据结构来源于对输入图的宽度优先搜索树的处理。此外，还详细讨论了生成节点组合的技术，以测试由节点组合引起的图的属性。我们的方法可以扩展到解决图上的其他组合计数问题，例如找出大小为k的连接子图的数量，团的数量(对应的数目)。大小为k的独立集，以及大型数据集的相关问题。在三角形计数算法的上下文中，我们分析并利用了诸如内存访问合并和避免分区露营之类的原语，这些原语抵消了使用更慢但更大的全局内存所增加的访问延迟。我们对GPU实现的实验结果表明，三角形计数的速度比CPU的速度至少提高了10倍。与naïve在GPU上实现该程序相比，利用上述原语可以获得另外6 - 8%的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum

自引率

0.00%

发文量