HashGrid：在 FPGA 上加速图计算的优化架构

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2024-08-28 DOI:10.1016/j.future.2024.107497

Amin Sahebi , Marco Procaccini , Roberto Giorgi

{"title":"HashGrid：在 FPGA 上加速图计算的优化架构","authors":"Amin Sahebi , Marco Procaccini , Roberto Giorgi","doi":"10.1016/j.future.2024.107497","DOIUrl":null,"url":null,"abstract":"<div><p>Large-scale graph processing poses challenges due to its size and irregular memory access patterns, causing performance degradation in common architectures, such as CPUs and GPUs. Recent research includes accelerating graph processing using Field Programmable Gate Arrays (FPGAs). FPGAs can provide very efficient acceleration thanks to reconfigurable on-chip resources. Although limited, these resources offer a larger design space than CPUs and GPUs.</p><p>We propose an approach in which data are preprocessed in small chunks with an optimized graph partitioning technique for execution on FPGA accelerators. The chunks, located on the host, are streamed directly into a customized memory layer implemented in the FPGA, which is tightly coupled with the processing elements responsible for the graph algorithm execution. This improves application memory access latency, which is crucial in large-sale graph computing performance.</p><p>This work presents a hardware design that, combined with graph partitioning, enables us to achieve high-performance and potentially scalable handling of large graphs (i.e., graphs with millions of vertices and billions of edges in current scenarios) while using popular graph algorithms. The proposed framework accelerates performance 56 times compared with CPU (multicore with 16 logical cores in our reference experiments), 2.5 times and 4 times faster compared to state-of-the-art FPGA and GPU solutions (FPGA has 15 compute units, and GPU reference has 128 streaming-multiprocessors in our experiments), respectively, when using the PageRank algorithm. For the Single-Source-Shortest-Past (SSSP) algorithm, we achieve speedups of up to 65x, 26x, and 18x compared to CPU, GPU, and FPGA works, respectively. Lastly, in the context of the Weakly Connected Component (WCC) algorithm, our framework achieves a speedup of up to 403 times compared to the CPU, 7.4x against the GPU, and it is faster than the FPGA alternatives up to 10.3x.</p></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"162 ","pages":"Article 107497"},"PeriodicalIF":6.2000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167739X24004618/pdfft?md5=7e32540f7a3e9f063a049f49611d08e9&pid=1-s2.0-S0167739X24004618-main.pdf","citationCount":"0","resultStr":"{\"title\":\"HashGrid: An optimized architecture for accelerating graph computing on FPGAs\",\"authors\":\"Amin Sahebi , Marco Procaccini , Roberto Giorgi\",\"doi\":\"10.1016/j.future.2024.107497\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Large-scale graph processing poses challenges due to its size and irregular memory access patterns, causing performance degradation in common architectures, such as CPUs and GPUs. Recent research includes accelerating graph processing using Field Programmable Gate Arrays (FPGAs). FPGAs can provide very efficient acceleration thanks to reconfigurable on-chip resources. Although limited, these resources offer a larger design space than CPUs and GPUs.</p><p>We propose an approach in which data are preprocessed in small chunks with an optimized graph partitioning technique for execution on FPGA accelerators. The chunks, located on the host, are streamed directly into a customized memory layer implemented in the FPGA, which is tightly coupled with the processing elements responsible for the graph algorithm execution. This improves application memory access latency, which is crucial in large-sale graph computing performance.</p><p>This work presents a hardware design that, combined with graph partitioning, enables us to achieve high-performance and potentially scalable handling of large graphs (i.e., graphs with millions of vertices and billions of edges in current scenarios) while using popular graph algorithms. The proposed framework accelerates performance 56 times compared with CPU (multicore with 16 logical cores in our reference experiments), 2.5 times and 4 times faster compared to state-of-the-art FPGA and GPU solutions (FPGA has 15 compute units, and GPU reference has 128 streaming-multiprocessors in our experiments), respectively, when using the PageRank algorithm. For the Single-Source-Shortest-Past (SSSP) algorithm, we achieve speedups of up to 65x, 26x, and 18x compared to CPU, GPU, and FPGA works, respectively. Lastly, in the context of the Weakly Connected Component (WCC) algorithm, our framework achieves a speedup of up to 403 times compared to the CPU, 7.4x against the GPU, and it is faster than the FPGA alternatives up to 10.3x.</p></div>\",\"PeriodicalId\":55132,\"journal\":{\"name\":\"Future Generation Computer Systems-The International Journal of Escience\",\"volume\":\"162 \",\"pages\":\"Article 107497\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2024-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0167739X24004618/pdfft?md5=7e32540f7a3e9f063a049f49611d08e9&pid=1-s2.0-S0167739X24004618-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Future Generation Computer Systems-The International Journal of Escience\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167739X24004618\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X24004618","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

大规模图形处理因其规模和不规则的内存访问模式而面临挑战，导致 CPU 和 GPU 等常见架构的性能下降。最近的研究包括利用现场可编程门阵列（FPGA）加速图处理。FPGA 具有可重新配置的片上资源，因此可以提供非常高效的加速。我们提出了一种方法，利用优化的图形分割技术，将数据分成小块进行预处理，以便在 FPGA 加速器上执行。位于主机上的数据块直接流向 FPGA 中的定制内存层，该内存层与负责图形算法执行的处理元件紧密耦合。这项工作提出了一种硬件设计，它与图分区相结合，使我们能够在使用流行图算法的同时，实现对大型图（即当前场景中具有数百万顶点和数十亿条边的图）的高性能和潜在可扩展处理。在使用 PageRank 算法时，与最先进的 FPGA 和 GPU 解决方案（在我们的实验中，FPGA 有 15 个计算单元，GPU 参考有 128 个流式多核处理器）相比，所提出的框架将性能分别提高了 56 倍、2.5 倍和 4 倍。在单源最短过去算法（SSSP）方面，与 CPU、GPU 和 FPGA 相比，我们的速度分别提高了 65 倍、26 倍和 18 倍。最后，在弱连接成分（WCC）算法方面，我们的框架与 CPU 相比提速达 403 倍，与 GPU 相比提速达 7.4 倍，与 FPGA 相比提速达 10.3 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

HashGrid: An optimized architecture for accelerating graph computing on FPGAs

Large-scale graph processing poses challenges due to its size and irregular memory access patterns, causing performance degradation in common architectures, such as CPUs and GPUs. Recent research includes accelerating graph processing using Field Programmable Gate Arrays (FPGAs). FPGAs can provide very efficient acceleration thanks to reconfigurable on-chip resources. Although limited, these resources offer a larger design space than CPUs and GPUs.

We propose an approach in which data are preprocessed in small chunks with an optimized graph partitioning technique for execution on FPGA accelerators. The chunks, located on the host, are streamed directly into a customized memory layer implemented in the FPGA, which is tightly coupled with the processing elements responsible for the graph algorithm execution. This improves application memory access latency, which is crucial in large-sale graph computing performance.

This work presents a hardware design that, combined with graph partitioning, enables us to achieve high-performance and potentially scalable handling of large graphs (i.e., graphs with millions of vertices and billions of edges in current scenarios) while using popular graph algorithms. The proposed framework accelerates performance 56 times compared with CPU (multicore with 16 logical cores in our reference experiments), 2.5 times and 4 times faster compared to state-of-the-art FPGA and GPU solutions (FPGA has 15 compute units, and GPU reference has 128 streaming-multiprocessors in our experiments), respectively, when using the PageRank algorithm. For the Single-Source-Shortest-Past (SSSP) algorithm, we achieve speedups of up to 65x, 26x, and 18x compared to CPU, GPU, and FPGA works, respectively. Lastly, in the context of the Weakly Connected Component (WCC) algorithm, our framework achieves a speedup of up to 403 times compared to the CPU, 7.4x against the GPU, and it is faster than the FPGA alternatives up to 10.3x.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.