Jiaming Xu;Shan Huang;Jinhao Li;Guyue Huang;Yuan Xie;Yu Wang;Guohao Dai
{"title":"Enabling Efficient Sparse Multiplications on GPUs With Heuristic Adaptability","authors":"Jiaming Xu;Shan Huang;Jinhao Li;Guyue Huang;Yuan Xie;Yu Wang;Guohao Dai","doi":"10.1109/TCAD.2024.3518413","DOIUrl":null,"url":null,"abstract":"Sparse matrix-vector/matrix multiplication, namely SpMMul, has become a fundamental operation during model inference in various domains. Previous studies have explored numerous optimizations to accelerate it. However, to enable efficient end-to-end inference, the following challenges remain unsolved: 1) incomplete design space and time-consuming preprocessing. Previous methods optimize SpMMul in limited loops and neglect the potential space exploration for further optimization, resulting in >30% waste of computing power. In addition, the preprocessing overhead in SparseTIR and DTC-SpMM is <inline-formula> <tex-math>$1000\\times $ </tex-math></inline-formula> larger than sparse computing; 2) incompatibility between static dataflow and dynamic input. A static dataflow can not always be efficient to all input, leading to >80% performance loss; and 3) simplistic algorithm performance analysis. Previous studies primarily analyze performance from algorithmic advantages, without considering other aspects like hardware and data features. To tackle the above challenges, we present DA-SpMMul, a Data-Aware heuristic GPU implementation for SpMMul in multiplatforms. DA-SpMMul creatively proposes: 1) complete design space based on theoretical computations and nontrivial implementations without preprocessing. We propose three orthogonal design principles based on theoretical computations and provide nontrivial implementations on standard formats, eliminating the complex preprocessing; 2) feature-enabled adaptive algorithm selection mechanism. We design a heuristic model to enable algorithm selection considering various features; and 3) comprehensive algorithm performance analysis. We extract the features from multiple perspectives and present a comprehensive performance analysis of all algorithms. DA-SpMMul supports PyTorch on both NVIDIA and AMD and achieves an average speedup of <inline-formula> <tex-math>$3.33\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$3.02\\times $ </tex-math></inline-formula> over NVIDIA cuSPARSE, and <inline-formula> <tex-math>$12.05\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$8.32\\times $ </tex-math></inline-formula> over AMD rocSPARSE for sparse matrix-vector multiplication and sparse matrix-matrix multiplication, and up to <inline-formula> <tex-math>$1.48\\times $ </tex-math></inline-formula> speedup against the state-of-the-art open-source algorithm. Integrated with graph neural network framework, PyG, DA-SpMMul achieves up to <inline-formula> <tex-math>$1.22\\times $ </tex-math></inline-formula> speedup on GCN inference.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2226-2239"},"PeriodicalIF":2.9000,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10802949/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Sparse matrix-vector/matrix multiplication, namely SpMMul, has become a fundamental operation during model inference in various domains. Previous studies have explored numerous optimizations to accelerate it. However, to enable efficient end-to-end inference, the following challenges remain unsolved: 1) incomplete design space and time-consuming preprocessing. Previous methods optimize SpMMul in limited loops and neglect the potential space exploration for further optimization, resulting in >30% waste of computing power. In addition, the preprocessing overhead in SparseTIR and DTC-SpMM is $1000\times $ larger than sparse computing; 2) incompatibility between static dataflow and dynamic input. A static dataflow can not always be efficient to all input, leading to >80% performance loss; and 3) simplistic algorithm performance analysis. Previous studies primarily analyze performance from algorithmic advantages, without considering other aspects like hardware and data features. To tackle the above challenges, we present DA-SpMMul, a Data-Aware heuristic GPU implementation for SpMMul in multiplatforms. DA-SpMMul creatively proposes: 1) complete design space based on theoretical computations and nontrivial implementations without preprocessing. We propose three orthogonal design principles based on theoretical computations and provide nontrivial implementations on standard formats, eliminating the complex preprocessing; 2) feature-enabled adaptive algorithm selection mechanism. We design a heuristic model to enable algorithm selection considering various features; and 3) comprehensive algorithm performance analysis. We extract the features from multiple perspectives and present a comprehensive performance analysis of all algorithms. DA-SpMMul supports PyTorch on both NVIDIA and AMD and achieves an average speedup of $3.33\times $ and $3.02\times $ over NVIDIA cuSPARSE, and $12.05\times $ and $8.32\times $ over AMD rocSPARSE for sparse matrix-vector multiplication and sparse matrix-matrix multiplication, and up to $1.48\times $ speedup against the state-of-the-art open-source algorithm. Integrated with graph neural network framework, PyG, DA-SpMMul achieves up to $1.22\times $ speedup on GCN inference.
期刊介绍:
The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.