一种有效的长向量结构代数旁路BFS算法

IF 2.1 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing Pub Date : 2025-07-11 DOI:10.1016/j.parco.2025.103147

Yuyao Niu, Marc Cacas

{"title":"一种有效的长向量结构代数旁路BFS算法","authors":"Yuyao Niu, Marc Cacas","doi":"10.1016/j.parco.2025.103147","DOIUrl":null,"url":null,"abstract":"<div><div>Breadth First Search (BFS) is a fundamental algorithm in scientific computing, databases, and network analysis applications. In the algebraic BFS paradigm, each BFS iteration is expressed as a sparse matrix–vector multiplication, allowing BFS to be accelerated and analyzed through well-established linear algebra primitives. Although much effort has been made to optimize algebraic BFS on parallel platforms such as CPUs, GPUs, and distributed memory systems, vector architectures that exploit Single Instruction Multiple Data (SIMD) parallelism, particularly with their high performance on sparse workloads, remain relatively underexplored for BFS.</div><div>In this paper, we propose the ALgebraic Bypass BFS Algorithm (ALBBA), a novel and efficient algebraic BFS implementation optimized for long vector architectures. ALBBA utilizes a customized variant of the SELL-<span><math><mi>C</mi></math></span>-<span><math><mi>σ</mi></math></span> data structure to fully exploit the SIMD capabilities. By integrating a vectorization-friendly search method alongside a two-level bypass strategy, we enhance both sparse matrix-sparse vector multiplication (SpMSpV) and sparse matrix-dense vector multiplication (SpMV) algorithms, which are crucial for algebraic BFS operations. We further incorporate merge primitives and adopt an efficient selection method for each BFS iteration. Our experiments on an NEC VE20B processor demonstrate that ALBBA achieves average speedups of 3.91<span><math><mo>×</mo></math></span> , 2.88<span><math><mo>×</mo></math></span> , and 1.46<span><math><mo>×</mo></math></span> over Enterprise, GraphBLAST, and Gunrock running on an NVIDIA H100 GPU, respectively.</div></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"125 ","pages":"Article 103147"},"PeriodicalIF":2.1000,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ALBBA: An efficient ALgebraic Bypass BFS Algorithm on long vector architectures\",\"authors\":\"Yuyao Niu, Marc Cacas\",\"doi\":\"10.1016/j.parco.2025.103147\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Breadth First Search (BFS) is a fundamental algorithm in scientific computing, databases, and network analysis applications. In the algebraic BFS paradigm, each BFS iteration is expressed as a sparse matrix–vector multiplication, allowing BFS to be accelerated and analyzed through well-established linear algebra primitives. Although much effort has been made to optimize algebraic BFS on parallel platforms such as CPUs, GPUs, and distributed memory systems, vector architectures that exploit Single Instruction Multiple Data (SIMD) parallelism, particularly with their high performance on sparse workloads, remain relatively underexplored for BFS.</div><div>In this paper, we propose the ALgebraic Bypass BFS Algorithm (ALBBA), a novel and efficient algebraic BFS implementation optimized for long vector architectures. ALBBA utilizes a customized variant of the SELL-<span><math><mi>C</mi></math></span>-<span><math><mi>σ</mi></math></span> data structure to fully exploit the SIMD capabilities. By integrating a vectorization-friendly search method alongside a two-level bypass strategy, we enhance both sparse matrix-sparse vector multiplication (SpMSpV) and sparse matrix-dense vector multiplication (SpMV) algorithms, which are crucial for algebraic BFS operations. We further incorporate merge primitives and adopt an efficient selection method for each BFS iteration. Our experiments on an NEC VE20B processor demonstrate that ALBBA achieves average speedups of 3.91<span><math><mo>×</mo></math></span> , 2.88<span><math><mo>×</mo></math></span> , and 1.46<span><math><mo>×</mo></math></span> over Enterprise, GraphBLAST, and Gunrock running on an NVIDIA H100 GPU, respectively.</div></div>\",\"PeriodicalId\":54642,\"journal\":{\"name\":\"Parallel Computing\",\"volume\":\"125 \",\"pages\":\"Article 103147\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-07-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Parallel Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167819125000237\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819125000237","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

广度优先搜索（BFS）是科学计算、数据库和网络分析应用中的基本算法。在代数BFS范式中，每次BFS迭代都表示为稀疏矩阵向量乘法，允许BFS通过建立良好的线性代数原语进行加速和分析。尽管在并行平台（如cpu、gpu和分布式内存系统）上优化代数BFS已经做了很多努力，但利用单指令多数据（SIMD）并行性的矢量架构，特别是在稀疏工作负载上的高性能，对于BFS的探索仍然相对不足。在本文中，我们提出了代数旁路BFS算法（ALgebraic Bypass BFS Algorithm， ALBBA），这是一种针对长向量结构优化的新颖高效的代数BFS实现。ALBBA利用SELL-C-σ数据结构的自定义变体来充分利用SIMD功能。通过将向量化友好搜索方法与两级绕过策略相结合，我们增强了稀疏矩阵-稀疏向量乘法（SpMSpV）和稀疏矩阵-密集向量乘法（SpMV）算法，这对代数BFS操作至关重要。我们进一步引入合并原语，并在每次BFS迭代中采用高效的选择方法。我们在NEC VE20B处理器上的实验表明，与NVIDIA H100 GPU上运行的Enterprise、GraphBLAST和Gunrock相比，ALBBA的平均速度分别达到了3.91倍、2.88倍和1.46倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ALBBA: An efficient ALgebraic Bypass BFS Algorithm on long vector architectures

Breadth First Search (BFS) is a fundamental algorithm in scientific computing, databases, and network analysis applications. In the algebraic BFS paradigm, each BFS iteration is expressed as a sparse matrix–vector multiplication, allowing BFS to be accelerated and analyzed through well-established linear algebra primitives. Although much effort has been made to optimize algebraic BFS on parallel platforms such as CPUs, GPUs, and distributed memory systems, vector architectures that exploit Single Instruction Multiple Data (SIMD) parallelism, particularly with their high performance on sparse workloads, remain relatively underexplored for BFS.

In this paper, we propose the ALgebraic Bypass BFS Algorithm (ALBBA), a novel and efficient algebraic BFS implementation optimized for long vector architectures. ALBBA utilizes a customized variant of the SELL-

C

σ

data structure to fully exploit the SIMD capabilities. By integrating a vectorization-friendly search method alongside a two-level bypass strategy, we enhance both sparse matrix-sparse vector multiplication (SpMSpV) and sparse matrix-dense vector multiplication (SpMV) algorithms, which are crucial for algebraic BFS operations. We further incorporate merge primitives and adopt an efficient selection method for each BFS iteration. Our experiments on an NEC VE20B processor demonstrate that ALBBA achieves average speedups of 3.91

\times

, 2.88

\times

, and 1.46

\times

over Enterprise, GraphBLAST, and Gunrock running on an NVIDIA H100 GPU, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Parallel Computing 工程技术-计算机：理论方法

CiteScore

3.50

自引率

7.10%

发文量

审稿时长

4.5 months

期刊介绍： Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications