Enabling Efficient Sparse Multiplications on GPUs With Heuristic Adaptability

IF 2.9 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Pub Date : 2024-12-16 DOI:10.1109/TCAD.2024.3518413

Jiaming Xu;Shan Huang;Jinhao Li;Guyue Huang;Yuan Xie;Yu Wang;Guohao Dai

{"title":"Enabling Efficient Sparse Multiplications on GPUs With Heuristic Adaptability","authors":"Jiaming Xu;Shan Huang;Jinhao Li;Guyue Huang;Yuan Xie;Yu Wang;Guohao Dai","doi":"10.1109/TCAD.2024.3518413","DOIUrl":null,"url":null,"abstract":"Sparse matrix-vector/matrix multiplication, namely SpMMul, has become a fundamental operation during model inference in various domains. Previous studies have explored numerous optimizations to accelerate it. However, to enable efficient end-to-end inference, the following challenges remain unsolved: 1) incomplete design space and time-consuming preprocessing. Previous methods optimize SpMMul in limited loops and neglect the potential space exploration for further optimization, resulting in >30% waste of computing power. In addition, the preprocessing overhead in SparseTIR and DTC-SpMM is <inline-formula> <tex-math>$1000\\times $ </tex-math></inline-formula> larger than sparse computing; 2) incompatibility between static dataflow and dynamic input. A static dataflow can not always be efficient to all input, leading to >80% performance loss; and 3) simplistic algorithm performance analysis. Previous studies primarily analyze performance from algorithmic advantages, without considering other aspects like hardware and data features. To tackle the above challenges, we present DA-SpMMul, a Data-Aware heuristic GPU implementation for SpMMul in multiplatforms. DA-SpMMul creatively proposes: 1) complete design space based on theoretical computations and nontrivial implementations without preprocessing. We propose three orthogonal design principles based on theoretical computations and provide nontrivial implementations on standard formats, eliminating the complex preprocessing; 2) feature-enabled adaptive algorithm selection mechanism. We design a heuristic model to enable algorithm selection considering various features; and 3) comprehensive algorithm performance analysis. We extract the features from multiple perspectives and present a comprehensive performance analysis of all algorithms. DA-SpMMul supports PyTorch on both NVIDIA and AMD and achieves an average speedup of <inline-formula> <tex-math>$3.33\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$3.02\\times $ </tex-math></inline-formula> over NVIDIA cuSPARSE, and <inline-formula> <tex-math>$12.05\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$8.32\\times $ </tex-math></inline-formula> over AMD rocSPARSE for sparse matrix-vector multiplication and sparse matrix-matrix multiplication, and up to <inline-formula> <tex-math>$1.48\\times $ </tex-math></inline-formula> speedup against the state-of-the-art open-source algorithm. Integrated with graph neural network framework, PyG, DA-SpMMul achieves up to <inline-formula> <tex-math>$1.22\\times $ </tex-math></inline-formula> speedup on GCN inference.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2226-2239"},"PeriodicalIF":2.9000,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10802949/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Sparse matrix-vector/matrix multiplication, namely SpMMul, has become a fundamental operation during model inference in various domains. Previous studies have explored numerous optimizations to accelerate it. However, to enable efficient end-to-end inference, the following challenges remain unsolved: 1) incomplete design space and time-consuming preprocessing. Previous methods optimize SpMMul in limited loops and neglect the potential space exploration for further optimization, resulting in >30% waste of computing power. In addition, the preprocessing overhead in SparseTIR and DTC-SpMM is

$1000\times $

larger than sparse computing; 2) incompatibility between static dataflow and dynamic input. A static dataflow can not always be efficient to all input, leading to >80% performance loss; and 3) simplistic algorithm performance analysis. Previous studies primarily analyze performance from algorithmic advantages, without considering other aspects like hardware and data features. To tackle the above challenges, we present DA-SpMMul, a Data-Aware heuristic GPU implementation for SpMMul in multiplatforms. DA-SpMMul creatively proposes: 1) complete design space based on theoretical computations and nontrivial implementations without preprocessing. We propose three orthogonal design principles based on theoretical computations and provide nontrivial implementations on standard formats, eliminating the complex preprocessing; 2) feature-enabled adaptive algorithm selection mechanism. We design a heuristic model to enable algorithm selection considering various features; and 3) comprehensive algorithm performance analysis. We extract the features from multiple perspectives and present a comprehensive performance analysis of all algorithms. DA-SpMMul supports PyTorch on both NVIDIA and AMD and achieves an average speedup of

$3.33\times $

and

$3.02\times $

over NVIDIA cuSPARSE, and

$12.05\times $

and

$8.32\times $

over AMD rocSPARSE for sparse matrix-vector multiplication and sparse matrix-matrix multiplication, and up to

$1.48\times $

speedup against the state-of-the-art open-source algorithm. Integrated with graph neural network framework, PyG, DA-SpMMul achieves up to

$1.22\times $

speedup on GCN inference.

查看原文本刊更多论文

启发式自适应在gpu上实现高效稀疏乘法

稀疏矩阵-向量/矩阵乘法，即SpMMul，已经成为各个领域模型推理的基本运算。以前的研究已经探索了许多优化来加速它。然而，为了实现高效的端到端推理，以下挑战尚未解决：1)不完整的设计空间和耗时的预处理。以往的方法在有限循环中优化SpMMul，忽略了潜在的空间探索进行进一步优化，导致计算能力浪费约30%。此外，SparseTIR和DTC-SpMM的预处理开销比稀疏计算大1000倍；2)静态数据流与动态输入不兼容。静态数据流不可能总是对所有输入都有效，这会导致80%的性能损失；3)简化算法性能分析。以往的研究主要是从算法优势来分析性能，没有考虑硬件、数据特征等其他方面。为了解决上述挑战，我们提出了DA-SpMMul，一种多平台SpMMul的数据感知启发式GPU实现。DA-SpMMul创造性地提出：1)基于理论计算的完整设计空间和无需预处理的非平凡实现。我们在理论计算的基础上提出了三种正交设计原则，并提供了标准格式的非平凡实现，消除了复杂的预处理；2)基于特征的自适应算法选择机制。我们设计了一个启发式模型，使算法选择考虑到各种特征；3)全面的算法性能分析。我们从多个角度提取特征，并对所有算法进行全面的性能分析。DA-SpMMul在NVIDIA和AMD上都支持PyTorch，并且在NVIDIA cuSPARSE上实现了3.33\times $和3.02\times $的平均加速，在稀疏矩阵-向量乘法和稀疏矩阵-矩阵乘法上实现了12.05\times $和8.32\times $的平均加速，在最先进的开源算法上实现了高达1.48\times $的加速。与图神经网络框架PyG集成，DA-SpMMul在GCN推理上实现了高达1.22倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 工程技术-工程：电子与电气

CiteScore

5.60

自引率

13.80%

发文量

500

审稿时长

7 months

期刊介绍： The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.