基于中粒度数据流的SpTRSV高效硬件加速器

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-11-20 DOI:10.1109/TVLSI.2024.3497166

Qian Chen;Xiaofeng Yang;Shengli Lu

{"title":"基于中粒度数据流的SpTRSV高效硬件加速器","authors":"Qian Chen;Xiaofeng Yang;Shengli Lu","doi":"10.1109/TVLSI.2024.3497166","DOIUrl":null,"url":null,"abstract":"Sparse triangular solve (SpTRSV) is widely used in various domains. Numerous studies have been conducted using CPUs, GPUs, and specific hardware accelerators, where dataflows can be categorized into coarse and fine granularity. Coarse dataflows offer good spatial locality but suffer from low parallelism, while fine dataflows provide high parallelism but disrupt the spatial structure, leading to increased nodes and poor data reuse. This article proposes a novel hardware accelerator for SpTRSV or SpTRSV-like directed acyclic graphs (DAGs). The accelerator implements a medium granularity dataflow through hardware-software codesign and achieves both excellent spatial locality and high parallelism. In addition, a partial sum caching mechanism is introduced to reduce the blocking frequency of processing elements (PEs), and a reordering algorithm of intranode edges’ computation is developed to enhance data reuse. Experimental results on 245 benchmarks with node counts reaching up to 85392 demonstrate that this work achieves average performance improvements of <inline-formula> <tex-math>$7.0\\times $ </tex-math></inline-formula> (up to <inline-formula> <tex-math>$27.8\\times $ </tex-math></inline-formula>) over CPUs and <inline-formula> <tex-math>$5.8\\times $ </tex-math></inline-formula> (up to <inline-formula> <tex-math>$98.8\\times $ </tex-math></inline-formula>) over GPUs. Compared with the state-of-the-art technique (DPU-v2), this work shows a <inline-formula> <tex-math>$2.5\\times $ </tex-math></inline-formula> (up to <inline-formula> <tex-math>$5.9\\times $ </tex-math></inline-formula>) average performance improvement and <inline-formula> <tex-math>$1.7\\times $ </tex-math></inline-formula> (up to <inline-formula> <tex-math>$4.1\\times $ </tex-math></inline-formula>) average energy efficiency enhancement.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"807-820"},"PeriodicalIF":2.8000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV\",\"authors\":\"Qian Chen;Xiaofeng Yang;Shengli Lu\",\"doi\":\"10.1109/TVLSI.2024.3497166\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Sparse triangular solve (SpTRSV) is widely used in various domains. Numerous studies have been conducted using CPUs, GPUs, and specific hardware accelerators, where dataflows can be categorized into coarse and fine granularity. Coarse dataflows offer good spatial locality but suffer from low parallelism, while fine dataflows provide high parallelism but disrupt the spatial structure, leading to increased nodes and poor data reuse. This article proposes a novel hardware accelerator for SpTRSV or SpTRSV-like directed acyclic graphs (DAGs). The accelerator implements a medium granularity dataflow through hardware-software codesign and achieves both excellent spatial locality and high parallelism. In addition, a partial sum caching mechanism is introduced to reduce the blocking frequency of processing elements (PEs), and a reordering algorithm of intranode edges’ computation is developed to enhance data reuse. Experimental results on 245 benchmarks with node counts reaching up to 85392 demonstrate that this work achieves average performance improvements of <inline-formula> <tex-math>$7.0\\\\times $ </tex-math></inline-formula> (up to <inline-formula> <tex-math>$27.8\\\\times $ </tex-math></inline-formula>) over CPUs and <inline-formula> <tex-math>$5.8\\\\times $ </tex-math></inline-formula> (up to <inline-formula> <tex-math>$98.8\\\\times $ </tex-math></inline-formula>) over GPUs. Compared with the state-of-the-art technique (DPU-v2), this work shows a <inline-formula> <tex-math>$2.5\\\\times $ </tex-math></inline-formula> (up to <inline-formula> <tex-math>$5.9\\\\times $ </tex-math></inline-formula>) average performance improvement and <inline-formula> <tex-math>$1.7\\\\times $ </tex-math></inline-formula> (up to <inline-formula> <tex-math>$4.1\\\\times $ </tex-math></inline-formula>) average energy efficiency enhancement.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 3\",\"pages\":\"807-820\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-11-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10759529/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10759529/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

稀疏三角解（SpTRSV）广泛应用于各个领域。已经使用cpu、gpu和特定硬件加速器进行了大量研究，其中数据流可以分为粗粒度和细粒度。粗数据流提供了良好的空间局部性，但并行性较低，而细数据流提供了高并行性，但破坏了空间结构，导致节点增加和数据重用性差。本文提出了一种新的用于SpTRSV或类SpTRSV有向无环图（dag）的硬件加速器。该加速器通过软硬件协同设计实现了中等粒度的数据流，实现了良好的空间局部性和高并行性。此外，引入了部分和缓存机制以降低处理元素（pe）的阻塞频率，并开发了内部节点边缘计算的重排序算法以提高数据重用。在245个节点数达到85392的基准测试上的实验结果表明，这项工作比cpu平均提高了7.0\times $（高达27.8\times $），比gpu平均提高了5.8\times $（高达98.8\times $）。与最先进的技术（DPU-v2）相比，这项工作显示平均性能提高2.5美元（最高5.9美元），平均能效提高1.7美元（最高4.1美元）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Hardware Accelerator Based on Medium Granularity Dataflow for SpTRSV

Sparse triangular solve (SpTRSV) is widely used in various domains. Numerous studies have been conducted using CPUs, GPUs, and specific hardware accelerators, where dataflows can be categorized into coarse and fine granularity. Coarse dataflows offer good spatial locality but suffer from low parallelism, while fine dataflows provide high parallelism but disrupt the spatial structure, leading to increased nodes and poor data reuse. This article proposes a novel hardware accelerator for SpTRSV or SpTRSV-like directed acyclic graphs (DAGs). The accelerator implements a medium granularity dataflow through hardware-software codesign and achieves both excellent spatial locality and high parallelism. In addition, a partial sum caching mechanism is introduced to reduce the blocking frequency of processing elements (PEs), and a reordering algorithm of intranode edges’ computation is developed to enhance data reuse. Experimental results on 245 benchmarks with node counts reaching up to 85392 demonstrate that this work achieves average performance improvements of

$7.0\times $

(up to

$27.8\times $

) over CPUs and

$5.8\times $

(up to

$98.8\times $

) over GPUs. Compared with the state-of-the-art technique (DPU-v2), this work shows a

$2.5\times $

(up to

$5.9\times $

) average performance improvement and

$1.7\times $

(up to

$4.1\times $

) average energy efficiency enhancement.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.