NUMA聚类中广度优先搜索的评价与优化

2012 IEEE International Conference on Cluster Computing Pub Date : 2012-09-24 DOI:10.1109/CLUSTER.2012.29

Zehan Cui, Licheng Chen, Mingyu Chen, Yungang Bao, Yongbing Huang, Huiwei Lv

{"title":"NUMA聚类中广度优先搜索的评价与优化","authors":"Zehan Cui, Licheng Chen, Mingyu Chen, Yungang Bao, Yongbing Huang, Huiwei Lv","doi":"10.1109/CLUSTER.2012.29","DOIUrl":null,"url":null,"abstract":"Graph is widely used in many areas. Breadth-First Search (BFS), a key subroutine for many graph analysis algorithms, has become the primary benchmark for Graph500 ranking. Due to the high communication cost of BFS, multi-socket nodes with large memory capacity (NUMA) are supposed to reduce network pressure. However, the longer latency to remote memory may cause problem if not treated well. In this work, we first demonstrate that simply spawning and binding one MPI process for each socket can achieve the best performance for MPI/OpenMP hybrid programmed BFS algorithm, resulting in 1.53X of performance on 16 nodes. Nevertheless, we notice that one MPI process per socket may exacerbate the communication cost. We propose to share some communication data structure among the processes inside the same node, to eliminate most of the intra-node communication. To fully utilize the network bandwidth, we make all the processes in a node to perform communication simultaneously. We further adjust the granularity of a key bitmap for better cache locality to speed up the computation. With all the optimizations for NUMA, communication and computation together, 2.44X of performance is achieved on 16 nodes, which is 39.2 Billion Traversed Edges per Second for an R-MAT graph of scale 32 (4 billion vertices and 64 billion edges).","PeriodicalId":143579,"journal":{"name":"2012 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Evaluation and Optimization of Breadth-First Search on NUMA Cluster\",\"authors\":\"Zehan Cui, Licheng Chen, Mingyu Chen, Yungang Bao, Yongbing Huang, Huiwei Lv\",\"doi\":\"10.1109/CLUSTER.2012.29\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graph is widely used in many areas. Breadth-First Search (BFS), a key subroutine for many graph analysis algorithms, has become the primary benchmark for Graph500 ranking. Due to the high communication cost of BFS, multi-socket nodes with large memory capacity (NUMA) are supposed to reduce network pressure. However, the longer latency to remote memory may cause problem if not treated well. In this work, we first demonstrate that simply spawning and binding one MPI process for each socket can achieve the best performance for MPI/OpenMP hybrid programmed BFS algorithm, resulting in 1.53X of performance on 16 nodes. Nevertheless, we notice that one MPI process per socket may exacerbate the communication cost. We propose to share some communication data structure among the processes inside the same node, to eliminate most of the intra-node communication. To fully utilize the network bandwidth, we make all the processes in a node to perform communication simultaneously. We further adjust the granularity of a key bitmap for better cache locality to speed up the computation. With all the optimizations for NUMA, communication and computation together, 2.44X of performance is achieved on 16 nodes, which is 39.2 Billion Traversed Edges per Second for an R-MAT graph of scale 32 (4 billion vertices and 64 billion edges).\",\"PeriodicalId\":143579,\"journal\":{\"name\":\"2012 IEEE International Conference on Cluster Computing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE International Conference on Cluster Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CLUSTER.2012.29\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2012.29","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

图形在许多领域都有广泛的应用。宽度优先搜索(BFS)是许多图分析算法的关键子程序，已成为Graph500排名的主要基准。由于BFS的通信成本较高，因此需要采用大内存容量(NUMA)的多套接字节点来减轻网络压力。但是，如果处理不当，到远程内存的较长延迟可能会导致问题。在这项工作中，我们首先证明了简单地为每个套接字生成和绑定一个MPI进程可以实现MPI/OpenMP混合编程BFS算法的最佳性能，从而在16个节点上获得1.53倍的性能。然而，我们注意到每个套接字一个MPI进程可能会增加通信成本。我们建议在同一节点内的进程之间共享一些通信数据结构，以消除节点内的大部分通信。为了充分利用网络带宽，我们使一个节点中的所有进程同时进行通信。我们进一步调整关键位图的粒度，以获得更好的缓存局部性，从而加快计算速度。通过对NUMA、通信和计算的所有优化，在16个节点上实现了2.44倍的性能，即对于规模为32(40亿个顶点和640亿个边)的R-MAT图，每秒392亿遍历边。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluation and Optimization of Breadth-First Search on NUMA Cluster

Graph is widely used in many areas. Breadth-First Search (BFS), a key subroutine for many graph analysis algorithms, has become the primary benchmark for Graph500 ranking. Due to the high communication cost of BFS, multi-socket nodes with large memory capacity (NUMA) are supposed to reduce network pressure. However, the longer latency to remote memory may cause problem if not treated well. In this work, we first demonstrate that simply spawning and binding one MPI process for each socket can achieve the best performance for MPI/OpenMP hybrid programmed BFS algorithm, resulting in 1.53X of performance on 16 nodes. Nevertheless, we notice that one MPI process per socket may exacerbate the communication cost. We propose to share some communication data structure among the processes inside the same node, to eliminate most of the intra-node communication. To fully utilize the network bandwidth, we make all the processes in a node to perform communication simultaneously. We further adjust the granularity of a key bitmap for better cache locality to speed up the computation. With all the optimizations for NUMA, communication and computation together, 2.44X of performance is achieved on 16 nodes, which is 39.2 Billion Traversed Edges per Second for an R-MAT graph of scale 32 (4 billion vertices and 64 billion edges).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量