GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI:10.1109/HPCA.2018.00053

Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, C. Kozyrakis, Xuehai Qian

{"title":"GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition","authors":"Mingxing Zhang, Youwei Zhuo, Chao Wang, Mingyu Gao, Yongwei Wu, Kang Chen, C. Kozyrakis, Xuehai Qian","doi":"10.1109/HPCA.2018.00053","DOIUrl":null,"url":null,"abstract":"Processing-In-Memory (PIM) is an effective technique that reduces data movements by integrating processing units within memory. The recent advance of “big data” and 3D stacking technology make PIM a practical and viable solution for the modern data processing workloads. It is exemplified by the recent research interests on PIM-based acceleration. Among them, TESSERACT is a PIM-enabled parallel graph processing architecture based on Micron’s Hybrid Memory Cube (HMC), one of the most prominent 3D-stacked memory technologies. It implements a Pregel-like vertex-centric programming model, so that users could develop programs in the familiar interface while taking advantage of PIM. Despite the orders of magnitude speedup compared to DRAM-based systems, TESSERACT generates excessive crosscube communications through SerDes links, whose bandwidth is much less than the aggregated local bandwidth of HMCs. Our investigation indicates that this is because of the restricted data organization required by the vertex programming model. In this paper, we argue that a PIM-based graph processing system should take data organization as a first-order design consideration. Following this principle, we propose GraphP, a novel HMC-based software/hardware co-designed graph processing system that drastically reduces communication and energy consumption compared to TESSERACT. GraphP features three key techniques. 1) “Source-cut” partitioning, which fundamentally changes the cross-cube communication from one remote put per cross-cube edge to one update per replica. 2) “Two-phase Vertex Program”, a programming model designed for the “source-cut” partitioning with two operations: GenUpdate and ApplyUpdate. 3) Hierarchical communication and overlapping, which further improves performance with unique opportunities offered by the proposed partitioning and programming model. We evaluate GraphP using a cycle accurate simulator with 5 real-world graphs and 4 algorithms. The results show that it provides on average 1.7 speedup and 89% energy saving compared to TESSERACT.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"163","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2018.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 163

Abstract

Processing-In-Memory (PIM) is an effective technique that reduces data movements by integrating processing units within memory. The recent advance of “big data” and 3D stacking technology make PIM a practical and viable solution for the modern data processing workloads. It is exemplified by the recent research interests on PIM-based acceleration. Among them, TESSERACT is a PIM-enabled parallel graph processing architecture based on Micron’s Hybrid Memory Cube (HMC), one of the most prominent 3D-stacked memory technologies. It implements a Pregel-like vertex-centric programming model, so that users could develop programs in the familiar interface while taking advantage of PIM. Despite the orders of magnitude speedup compared to DRAM-based systems, TESSERACT generates excessive crosscube communications through SerDes links, whose bandwidth is much less than the aggregated local bandwidth of HMCs. Our investigation indicates that this is because of the restricted data organization required by the vertex programming model. In this paper, we argue that a PIM-based graph processing system should take data organization as a first-order design consideration. Following this principle, we propose GraphP, a novel HMC-based software/hardware co-designed graph processing system that drastically reduces communication and energy consumption compared to TESSERACT. GraphP features three key techniques. 1) “Source-cut” partitioning, which fundamentally changes the cross-cube communication from one remote put per cross-cube edge to one update per replica. 2) “Two-phase Vertex Program”, a programming model designed for the “source-cut” partitioning with two operations: GenUpdate and ApplyUpdate. 3) Hierarchical communication and overlapping, which further improves performance with unique opportunities offered by the proposed partitioning and programming model. We evaluate GraphP using a cycle accurate simulator with 5 real-world graphs and 4 algorithms. The results show that it provides on average 1.7 speedup and 89% energy saving compared to TESSERACT.

查看原文本刊更多论文

GraphP:基于pim的高效数据分区图处理中减少通信

内存中处理(PIM)是一种有效的技术，它通过在内存中集成处理单元来减少数据移动。近年来“大数据”和3D叠加技术的发展使PIM成为现代数据处理工作负载的实用可行的解决方案。最近对基于pim的加速的研究兴趣就是一个例子。其中，TESSERACT是基于美光混合记忆体(HMC)的pim并行图形处理架构，HMC是最著名的3d堆叠记忆体技术之一。它实现了一个类似于pregel的以顶点为中心的编程模型，使用户可以在熟悉的界面中开发程序，同时利用PIM的优势。尽管与基于dram的系统相比，TESSERACT的速度提高了几个数量级，但它通过SerDes链路产生了过多的交叉立方体通信，其带宽远远小于hmc的聚合本地带宽。我们的研究表明，这是由于顶点规划模型所要求的有限的数据组织。在本文中，我们认为基于pim的图形处理系统应该将数据组织作为一阶设计考虑。遵循这一原则，我们提出了GraphP，一个新的基于hmc的软件/硬件协同设计的图形处理系统，与TESSERACT相比，它大大减少了通信和能耗。GraphP具有三个关键技术。1)“源切割”分区，它从根本上改变了跨立方体通信，从每个跨立方体边缘一个远程放置到每个副本一个更新。2)“两阶段顶点程序”(two -phase Vertex Program)，为“源切”分区设计的编程模型，有两个操作:GenUpdate和ApplyUpdate。3)分层通信和重叠，利用所提出的划分和编程模型提供的独特机会进一步提高性能。我们使用具有5个真实世界图形和4种算法的循环精确模拟器来评估GraphP。结果表明，与TESSERACT相比，它提供了平均1.7的加速和89%的节能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量