Haishuang Fan;Rui Meng;Qichu Sun;Jingya Wu;Wenyan Lu;Xiaowei Li;Guihai Yan
{"title":"基于FPGA的端到端图形处理加速器与图形重排序引擎","authors":"Haishuang Fan;Rui Meng;Qichu Sun;Jingya Wu;Wenyan Lu;Xiaowei Li;Guihai Yan","doi":"10.1109/TCAD.2025.3555192","DOIUrl":null,"url":null,"abstract":"Graphs play an important role in various applications. With the rapid expansion of vertices in real life, existing large-scale graph processing frameworks on CPUs and GPUs encounter challenges in optimizing cache usage due to irregular memory access patterns. To address this, graph reordering has been proposed to improve the locality of the graph, but introduces significant overhead without delivering substantial end-to-end performance improvement. While there have been many FPGA-based accelerators for graph processing, achieving high throughput often requires complex graph prepossessing on CPUs. Therefore, implementing an efficient end-to-end graph processing system remains challenging. This article introduces GRACE, an end-to-end FPGA-based graph processing accelerator with a graph reordering engine and a pull-based vertex-centric programming model (PL-VCPM) Engine. First, GRACE employs a customized high-degree vertex cache (HDC) to improve memory access efficiency. Second, GRACE offloads the graph preprocessing to FPGA. We customize an efficient graph reordering engine to complete preprocessing. Third, GRACE adopts a graph pruning strategy to remove the activation and computation redundancy in graph processing. Finally, GRACE introduces a graph conflict board (GCB) to resolve data conflicts and a multiport cache to enhance parallel efficiency. Experimental results demonstrate that GRACE achieves <inline-formula> <tex-math>$7.1 \\times $ </tex-math></inline-formula> end-to-end performance speedup over CPU and <inline-formula> <tex-math>$1.8 \\times $ </tex-math></inline-formula> over GPU, as well as <inline-formula> <tex-math>$27.3 \\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$8.7 \\times $ </tex-math></inline-formula> energy efficiency over CPU and GPU. Moreover, GRACE delivers up to <inline-formula> <tex-math>$34.9 \\times $ </tex-math></inline-formula> performance speedup compared to the state-of-the-art FPGA accelerator.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 10","pages":"3816-3829"},"PeriodicalIF":2.9000,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GRACE: An End-to-End Graph Processing Accelerator on FPGA With Graph Reordering Engine\",\"authors\":\"Haishuang Fan;Rui Meng;Qichu Sun;Jingya Wu;Wenyan Lu;Xiaowei Li;Guihai Yan\",\"doi\":\"10.1109/TCAD.2025.3555192\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphs play an important role in various applications. With the rapid expansion of vertices in real life, existing large-scale graph processing frameworks on CPUs and GPUs encounter challenges in optimizing cache usage due to irregular memory access patterns. To address this, graph reordering has been proposed to improve the locality of the graph, but introduces significant overhead without delivering substantial end-to-end performance improvement. While there have been many FPGA-based accelerators for graph processing, achieving high throughput often requires complex graph prepossessing on CPUs. Therefore, implementing an efficient end-to-end graph processing system remains challenging. This article introduces GRACE, an end-to-end FPGA-based graph processing accelerator with a graph reordering engine and a pull-based vertex-centric programming model (PL-VCPM) Engine. First, GRACE employs a customized high-degree vertex cache (HDC) to improve memory access efficiency. Second, GRACE offloads the graph preprocessing to FPGA. We customize an efficient graph reordering engine to complete preprocessing. Third, GRACE adopts a graph pruning strategy to remove the activation and computation redundancy in graph processing. Finally, GRACE introduces a graph conflict board (GCB) to resolve data conflicts and a multiport cache to enhance parallel efficiency. Experimental results demonstrate that GRACE achieves <inline-formula> <tex-math>$7.1 \\\\times $ </tex-math></inline-formula> end-to-end performance speedup over CPU and <inline-formula> <tex-math>$1.8 \\\\times $ </tex-math></inline-formula> over GPU, as well as <inline-formula> <tex-math>$27.3 \\\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$8.7 \\\\times $ </tex-math></inline-formula> energy efficiency over CPU and GPU. Moreover, GRACE delivers up to <inline-formula> <tex-math>$34.9 \\\\times $ </tex-math></inline-formula> performance speedup compared to the state-of-the-art FPGA accelerator.\",\"PeriodicalId\":13251,\"journal\":{\"name\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"volume\":\"44 10\",\"pages\":\"3816-3829\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2025-03-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10939011/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10939011/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
GRACE: An End-to-End Graph Processing Accelerator on FPGA With Graph Reordering Engine
Graphs play an important role in various applications. With the rapid expansion of vertices in real life, existing large-scale graph processing frameworks on CPUs and GPUs encounter challenges in optimizing cache usage due to irregular memory access patterns. To address this, graph reordering has been proposed to improve the locality of the graph, but introduces significant overhead without delivering substantial end-to-end performance improvement. While there have been many FPGA-based accelerators for graph processing, achieving high throughput often requires complex graph prepossessing on CPUs. Therefore, implementing an efficient end-to-end graph processing system remains challenging. This article introduces GRACE, an end-to-end FPGA-based graph processing accelerator with a graph reordering engine and a pull-based vertex-centric programming model (PL-VCPM) Engine. First, GRACE employs a customized high-degree vertex cache (HDC) to improve memory access efficiency. Second, GRACE offloads the graph preprocessing to FPGA. We customize an efficient graph reordering engine to complete preprocessing. Third, GRACE adopts a graph pruning strategy to remove the activation and computation redundancy in graph processing. Finally, GRACE introduces a graph conflict board (GCB) to resolve data conflicts and a multiport cache to enhance parallel efficiency. Experimental results demonstrate that GRACE achieves $7.1 \times $ end-to-end performance speedup over CPU and $1.8 \times $ over GPU, as well as $27.3 \times $ and $8.7 \times $ energy efficiency over CPU and GPU. Moreover, GRACE delivers up to $34.9 \times $ performance speedup compared to the state-of-the-art FPGA accelerator.
期刊介绍:
The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.