Heterogeneous sparse matrix–vector multiplication via compressed sparse row format

IF 2 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

Parallel Computing Pub Date : 2023-02-01 DOI:10.1016/j.parco.2023.102997

Phillip Allen Lane, Joshua Dennis Booth

{"title":"Heterogeneous sparse matrix–vector multiplication via compressed sparse row format","authors":"Phillip Allen Lane, Joshua Dennis Booth","doi":"10.1016/j.parco.2023.102997","DOIUrl":null,"url":null,"abstract":"<div><p>Sparse matrix–vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special care to store and tune for a given device. Moreover, HPC is facing heterogeneous hardware containing multiple different compute units, e.g., many-core CPUs and GPUs. Therefore, an emerging goal has been to produce heterogeneous formats and methods that allow critical kernels, e.g., SpMV, to be executed on different devices with portable performance and minimal changes to format and method. This paper presents a heterogeneous format based on CSR, named CSR-<span><math><mi>k</mi></math></span>, that can be tuned quickly and outperforms the average performance of Intel MKL on Intel Xeon Platinum 838 and AMD Epyc 7742 CPUs while still outperforming NVIDIA’s cuSPARSE and Sandia National Laboratories’ KokkosKernels on NVIDIA A100 and V100 for regular sparse matrices, i.e., sparse matrices where the number of nonzeros per row has a variance <span><math><mo>≤</mo></math></span>10, such as those commonly generated from two and three-dimensional finite difference and element problems. In particular, CSR-<span><math><mi>k</mi></math></span> achieves this with reordering and by grouping rows into a hierarchical structure of super-rows and super–super-rows that are represented by just a few extra arrays of pointers. Due to its simplicity, a model can be tuned for a device, and this model can be used to select super-row and super–super-rows sizes in constant time.</p></div>","PeriodicalId":54642,"journal":{"name":"Parallel Computing","volume":"115 ","pages":"Article 102997"},"PeriodicalIF":2.0000,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Parallel Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167819123000030","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Sparse matrix–vector multiplication (SpMV) is one of the most important kernels in high-performance computing (HPC), yet SpMV normally suffers from ill performance on many devices. Due to ill performance, SpMV normally requires special care to store and tune for a given device. Moreover, HPC is facing heterogeneous hardware containing multiple different compute units, e.g., many-core CPUs and GPUs. Therefore, an emerging goal has been to produce heterogeneous formats and methods that allow critical kernels, e.g., SpMV, to be executed on different devices with portable performance and minimal changes to format and method. This paper presents a heterogeneous format based on CSR, named CSR- $k$ , that can be tuned quickly and outperforms the average performance of Intel MKL on Intel Xeon Platinum 838 and AMD Epyc 7742 CPUs while still outperforming NVIDIA’s cuSPARSE and Sandia National Laboratories’ KokkosKernels on NVIDIA A100 and V100 for regular sparse matrices, i.e., sparse matrices where the number of nonzeros per row has a variance $\leq$ 10, such as those commonly generated from two and three-dimensional finite difference and element problems. In particular, CSR- $k$ achieves this with reordering and by grouping rows into a hierarchical structure of super-rows and super–super-rows that are represented by just a few extra arrays of pointers. Due to its simplicity, a model can be tuned for a device, and this model can be used to select super-row and super–super-rows sizes in constant time.

查看原文本刊更多论文

异构稀疏矩阵-向量乘法压缩稀疏行格式

稀疏矩阵-向量乘法（SpMV）是高性能计算（HPC）中最重要的核心之一，但SpMV在许多设备上通常性能不佳。由于性能不佳，SpMV通常需要特别小心存储和调谐特定设备。此外，HPC面临着包含多个不同计算单元的异构硬件，例如，许多核心CPU和GPU。因此，一个新兴的目标是产生异构格式和方法，使关键内核（如SpMV）能够在不同的设备上执行，具有便携性能，并且对格式和方法的更改最小。本文提出了一种基于CSR的异构格式，名为CSR-k，它可以快速调整，在英特尔至强Platinum 838和AMD Epyc 7742 CPU上的平均性能优于英特尔MKL，同时在NVIDIA A100和V100上的规则稀疏矩阵上仍优于NVIDIA的cuSPARSE和桑迪亚国家实验室的KokkosKernels，即。，稀疏矩阵，其中每行的非零个数方差≤10，例如通常由二维和三维有限差分和单元问题生成的稀疏矩阵。特别是，CSR-k通过重新排序和将行分组为超级行和超级行的分层结构来实现这一点，超级行和超超级行仅由几个额外的指针数组表示。由于其简单性，可以为设备调整模型，并且该模型可以用于在恒定时间内选择超级行和超级-超级行的大小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Parallel Computing 工程技术-计算机：理论方法

CiteScore

3.50

自引率

7.10%

发文量

审稿时长

4.5 months

期刊介绍： Parallel Computing is an international journal presenting the practical use of parallel computer systems, including high performance architecture, system software, programming systems and tools, and applications. Within this context the journal covers all aspects of high-end parallel computing from single homogeneous or heterogenous computing nodes to large-scale multi-node systems. Parallel Computing features original research work and review articles as well as novel or illustrative accounts of application experience with (and techniques for) the use of parallel computers. We also welcome studies reproducing prior publications that either confirm or disprove prior published results. Particular technical areas of interest include, but are not limited to: -System software for parallel computer systems including programming languages (new languages as well as compilation techniques), operating systems (including middleware), and resource management (scheduling and load-balancing). -Enabling software including debuggers, performance tools, and system and numeric libraries. -General hardware (architecture) concepts, new technologies enabling the realization of such new concepts, and details of commercially available systems -Software engineering and productivity as it relates to parallel computing -Applications (including scientific computing, deep learning, machine learning) or tool case studies demonstrating novel ways to achieve parallelism -Performance measurement results on state-of-the-art systems -Approaches to effectively utilize large-scale parallel computing including new algorithms or algorithm analysis with demonstrated relevance to real applications using existing or next generation parallel computer architectures. -Parallel I/O systems both hardware and software -Networking technology for support of high-speed computing demonstrating the impact of high-speed computation on parallel applications