Vesper: A Versatile Sparse Linear Algebra Accelerator With Configurable Compute Patterns

IF 2.7 3区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Hanchen Jin;Zichao Yue;Zhongyuan Zhao;Yixiao Du;Chenhui Deng;Nitish Srivastava;Zhiru Zhang
{"title":"Vesper: A Versatile Sparse Linear Algebra Accelerator With Configurable Compute Patterns","authors":"Hanchen Jin;Zichao Yue;Zhongyuan Zhao;Yixiao Du;Chenhui Deng;Nitish Srivastava;Zhiru Zhang","doi":"10.1109/TCAD.2024.3496882","DOIUrl":null,"url":null,"abstract":"Sparse linear algebra (SLA) operations are fundamental building blocks for many important applications, such as data analytics, graph processing, machine learning, and scientific computing. In particular, four compute kernels in SLA are widely used, including sparse-matrix-dense-vector multiplication, sparse-matrix-dense-matrix multiplication, sparse-matrix-sparse-vector multiplication, and sparse-matrix-sparse-matrix multiplication. Recently, an active area of research has emerged to build specialized hardware accelerators for these SLA kernels. However, existing efforts mostly focus on accelerating a single kernel and the proposed accelerator architectures are often limited to a specific compute pattern, such as inner, outer, or row-wise product. This work proposes Vesper, a high-performance and versatile sparse accelerator that supports all four important SLA kernels while being configurable to execute the compute patterns suitable for different kernels under various degrees of sparsity. To enable rapid exploration of the large architectural design and configuration space, we devise an analytical model to estimate the performance of an SLA kernel running on a given hardware configuration using a specific compute pattern. Guided by our model, we build a flexible yet efficient accelerator architecture that maximizes the resource sharing amongst the hardware modules used for different SLA kernels and the associated compute patterns. We evaluate the performance of Vesper using gem5 on a diverse set of matrices from SuiteSparse. Our experiment results show that Vesper achieves a comparable or higher throughput with increased bandwidth efficiency than the state-of-the-art accelerators that are tailor-made for a specific SLA kernel. In addition, we evaluate Vesper on a real-world application called label propagation (LP), an iterative graph-based learning algorithm that involves multiple SLA kernels and exhibits varying degrees of sparsity across iterations. Compared to CPU- and GPU-based executions, Vesper speeds up the LP algorithm by <inline-formula> <tex-math>$12.0\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$1.7\\times $ </tex-math></inline-formula>, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 5","pages":"1731-1744"},"PeriodicalIF":2.7000,"publicationDate":"2024-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10752521/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Sparse linear algebra (SLA) operations are fundamental building blocks for many important applications, such as data analytics, graph processing, machine learning, and scientific computing. In particular, four compute kernels in SLA are widely used, including sparse-matrix-dense-vector multiplication, sparse-matrix-dense-matrix multiplication, sparse-matrix-sparse-vector multiplication, and sparse-matrix-sparse-matrix multiplication. Recently, an active area of research has emerged to build specialized hardware accelerators for these SLA kernels. However, existing efforts mostly focus on accelerating a single kernel and the proposed accelerator architectures are often limited to a specific compute pattern, such as inner, outer, or row-wise product. This work proposes Vesper, a high-performance and versatile sparse accelerator that supports all four important SLA kernels while being configurable to execute the compute patterns suitable for different kernels under various degrees of sparsity. To enable rapid exploration of the large architectural design and configuration space, we devise an analytical model to estimate the performance of an SLA kernel running on a given hardware configuration using a specific compute pattern. Guided by our model, we build a flexible yet efficient accelerator architecture that maximizes the resource sharing amongst the hardware modules used for different SLA kernels and the associated compute patterns. We evaluate the performance of Vesper using gem5 on a diverse set of matrices from SuiteSparse. Our experiment results show that Vesper achieves a comparable or higher throughput with increased bandwidth efficiency than the state-of-the-art accelerators that are tailor-made for a specific SLA kernel. In addition, we evaluate Vesper on a real-world application called label propagation (LP), an iterative graph-based learning algorithm that involves multiple SLA kernels and exhibits varying degrees of sparsity across iterations. Compared to CPU- and GPU-based executions, Vesper speeds up the LP algorithm by $12.0\times $ and $1.7\times $ , respectively.
Vesper:具有可配置计算模式的多功能稀疏线性代数加速器
稀疏线性代数(SLA)操作是许多重要应用的基本构建块,例如数据分析、图处理、机器学习和科学计算。特别是SLA中广泛使用的四种计算内核,分别是稀疏矩阵-密集向量乘法、稀疏矩阵-密集矩阵乘法、稀疏矩阵-稀疏向量乘法和稀疏矩阵-稀疏矩阵乘法。最近,出现了一个活跃的研究领域,为这些SLA内核构建专门的硬件加速器。然而,现有的努力主要集中在加速单个内核上,而建议的加速器体系结构通常仅限于特定的计算模式,例如内部、外部或逐行产品。这项工作提出了Vesper,一个高性能和通用的稀疏加速器,它支持所有四个重要的SLA内核,同时可以配置为在不同程度的稀疏性下执行适合不同内核的计算模式。为了能够快速探索大型架构设计和配置空间,我们设计了一个分析模型,使用特定的计算模式来估计在给定硬件配置上运行的SLA内核的性能。在我们的模型的指导下,我们构建了一个灵活而高效的加速器体系结构,该体系结构最大限度地实现了用于不同SLA内核和相关计算模式的硬件模块之间的资源共享。我们在来自SuiteSparse的一组不同矩阵上使用gem5来评估Vesper的性能。我们的实验结果表明,与为特定SLA内核量身定制的最先进的加速器相比,Vesper在提高带宽效率的情况下实现了相当或更高的吞吐量。此外,我们在一个称为标签传播(LP)的实际应用程序上评估了Vesper,这是一种迭代的基于图的学习算法,涉及多个SLA内核,并在迭代中表现出不同程度的稀疏性。与基于CPU和gpu的执行相比,Vesper将LP算法的速度分别提高了12.0倍和1.7倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
5.60
自引率
13.80%
发文量
500
审稿时长
7 months
期刊介绍: The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信