HuGraph:基于量化的异构FPGA集群GCN训练加速

2022 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2022-09-19 DOI:10.1109/HPEC55821.2022.9926312

Letian Zhao, Qizhe Wu, Xiaotian Wang, Teng Tian, Wei Wu, Xi Jin

{"title":"HuGraph:基于量化的异构FPGA集群GCN训练加速","authors":"Letian Zhao, Qizhe Wu, Xiaotian Wang, Teng Tian, Wei Wu, Xi Jin","doi":"10.1109/HPEC55821.2022.9926312","DOIUrl":null,"url":null,"abstract":"Graph convolutional networks (GCNs) have suc-ceeded significantly in numerous fields, but the need for higher performance and energy efficiency training GCN on larger graphs continues unabated. At the same time, since recon-figurable accelerators have the ability to fine-grained custom computing modules and data movement, FPGAs can solve problems such as irregular memory access for GCN computing. Furthermore, to scale GCN computation, the use of heteroge-neous FPGAs is inevitable due to the constant iteration of new FPGAs. In this paper, we propose a novel framework, HuGraph, which automatically maps GCN training on heterogeneous FPGA clusters. With HuGraph, FPGAs work in synchronous data parallelism using a simple ring 1D topology that is suitable for most off-the-shelf FPGA clusters. HuGraph uses three approaches to advance performance and energy efficiency. First, HuGraph applies full-process quantization for neighbor-sampling-based data parallel training, thereby reducing computation and mem-ory consumption. Second, a novel balanced sampler is used to balance workloads among heterogeneous FPGAs so that FPGAs with fewer resources do not become bottlenecks in the cluster. Third, HuGraph schedules the execution order of GCN training to minimize time overhead. We implement a prototype on a single FPGA and evaluate cluster-level performance with a cycle-accurate simulator. Experiments show that HuGraph achieves up to 102.3 ×, 4.62×, and 11.1× speedup compared with the state-of-the-art works on CPU, GPU, and FPGA platforms, respectively, with negligible accuracy loss.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"HuGraph: Acceleration of GCN Training on Heterogeneous FPGA Clusters with Quantization\",\"authors\":\"Letian Zhao, Qizhe Wu, Xiaotian Wang, Teng Tian, Wei Wu, Xi Jin\",\"doi\":\"10.1109/HPEC55821.2022.9926312\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graph convolutional networks (GCNs) have suc-ceeded significantly in numerous fields, but the need for higher performance and energy efficiency training GCN on larger graphs continues unabated. At the same time, since recon-figurable accelerators have the ability to fine-grained custom computing modules and data movement, FPGAs can solve problems such as irregular memory access for GCN computing. Furthermore, to scale GCN computation, the use of heteroge-neous FPGAs is inevitable due to the constant iteration of new FPGAs. In this paper, we propose a novel framework, HuGraph, which automatically maps GCN training on heterogeneous FPGA clusters. With HuGraph, FPGAs work in synchronous data parallelism using a simple ring 1D topology that is suitable for most off-the-shelf FPGA clusters. HuGraph uses three approaches to advance performance and energy efficiency. First, HuGraph applies full-process quantization for neighbor-sampling-based data parallel training, thereby reducing computation and mem-ory consumption. Second, a novel balanced sampler is used to balance workloads among heterogeneous FPGAs so that FPGAs with fewer resources do not become bottlenecks in the cluster. Third, HuGraph schedules the execution order of GCN training to minimize time overhead. We implement a prototype on a single FPGA and evaluate cluster-level performance with a cycle-accurate simulator. Experiments show that HuGraph achieves up to 102.3 ×, 4.62×, and 11.1× speedup compared with the state-of-the-art works on CPU, GPU, and FPGA platforms, respectively, with negligible accuracy loss.\",\"PeriodicalId\":200071,\"journal\":{\"name\":\"2022 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"63 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC55821.2022.9926312\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC55821.2022.9926312","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

图卷积网络(GCNs)在许多领域都取得了显著的成功，但是在更大的图上训练GCN对更高性能和能效的需求仍然有增无减。同时，由于可重构加速器具有细粒度自定义计算模块和数据移动的能力，fpga可以解决GCN计算的不规则内存访问等问题。此外，为了扩展GCN计算，由于新fpga的不断迭代，使用异构fpga是不可避免的。在本文中，我们提出了一个新的框架，HuGraph，它自动映射GCN训练在异构FPGA集群上。使用HuGraph, FPGA使用简单的环形1D拓扑以同步数据并行工作，适用于大多数现成的FPGA集群。HuGraph使用三种方法来提高性能和能源效率。首先，HuGraph对基于邻域采样的数据并行训练采用全程量化，从而减少了计算量和内存消耗。其次，采用一种新的平衡采样器来平衡异构fpga之间的工作负载，使资源较少的fpga不会成为集群中的瓶颈。第三，HuGraph调度GCN训练的执行顺序，最小化时间开销。我们在单个FPGA上实现了原型，并使用周期精确模拟器评估了集群级性能。实验表明，与目前最先进的CPU、GPU和FPGA平台相比，HuGraph分别实现了102.3倍、4.62倍和11.1倍的加速，而精度损失可以忽略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

HuGraph: Acceleration of GCN Training on Heterogeneous FPGA Clusters with Quantization

Graph convolutional networks (GCNs) have suc-ceeded significantly in numerous fields, but the need for higher performance and energy efficiency training GCN on larger graphs continues unabated. At the same time, since recon-figurable accelerators have the ability to fine-grained custom computing modules and data movement, FPGAs can solve problems such as irregular memory access for GCN computing. Furthermore, to scale GCN computation, the use of heteroge-neous FPGAs is inevitable due to the constant iteration of new FPGAs. In this paper, we propose a novel framework, HuGraph, which automatically maps GCN training on heterogeneous FPGA clusters. With HuGraph, FPGAs work in synchronous data parallelism using a simple ring 1D topology that is suitable for most off-the-shelf FPGA clusters. HuGraph uses three approaches to advance performance and energy efficiency. First, HuGraph applies full-process quantization for neighbor-sampling-based data parallel training, thereby reducing computation and mem-ory consumption. Second, a novel balanced sampler is used to balance workloads among heterogeneous FPGAs so that FPGAs with fewer resources do not become bottlenecks in the cluster. Third, HuGraph schedules the execution order of GCN training to minimize time overhead. We implement a prototype on a single FPGA and evaluate cluster-level performance with a cycle-accurate simulator. Experiments show that HuGraph achieves up to 102.3 ×, 4.62×, and 11.1× speedup compared with the state-of-the-art works on CPU, GPU, and FPGA platforms, respectively, with negligible accuracy loss.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量