A Hybrid Systolic-Dataflow Architecture for Inductive Matrix Algorithms

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2020-02-01 DOI:10.1109/HPCA47549.2020.00063

Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki

{"title":"A Hybrid Systolic-Dataflow Architecture for Inductive Matrix Algorithms","authors":"Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, Tony Nowatzki","doi":"10.1109/HPCA47549.2020.00063","DOIUrl":null,"url":null,"abstract":"Dense linear algebra kernels are critical for wireless, and the oncoming proliferation of 5G only amplifies their importance. Due to the inductive nature of many such algorithms, parallelism is difficult to exploit: parallel regions have fine-grain producer/consumer interaction with iteratively changing depen-dence distance, reuse rate, and memory access patterns. This makes multi-threading impractical due to fine-grain synchronization, and vectorization ineffective due to the non-rectangular iteration domain. CPUs, DSPs, and GPUs perform order-of-magnitude below peak. Our insight is that if the nature of inductive dependences and memory accesses were explicit in the hardware/software interface, then a spatial architecture could efficiently execute parallel code regions. To this end, we first develop a novel execution model, inductive dataflow, where inductive dependence patterns and memory access patterns (streams) are first-order primitives. Second, we develop a hybrid spatial architecture combining systolic and tagged dataflow execution to attain high utilization at low energy and area cost. Finally, we create a scalable design through a novel vector-stream control model which amortizes control overhead both in time and spatially across architecture lanes. We evaluate our design, REVEL, with a full stack (compiler, ISA, simulator, RTL). Across a suite of linear algebra kernels, REVEL outperforms equally-provisioned DSPs by 4.6×-37×. Compared to state-of-the-art spatial architectures, REVEL is mean 3× faster. Compared to a set of ASICs, REVEL is only 2× the power and half the area.","PeriodicalId":339648,"journal":{"name":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"41","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA47549.2020.00063","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 41

Abstract

Dense linear algebra kernels are critical for wireless, and the oncoming proliferation of 5G only amplifies their importance. Due to the inductive nature of many such algorithms, parallelism is difficult to exploit: parallel regions have fine-grain producer/consumer interaction with iteratively changing depen-dence distance, reuse rate, and memory access patterns. This makes multi-threading impractical due to fine-grain synchronization, and vectorization ineffective due to the non-rectangular iteration domain. CPUs, DSPs, and GPUs perform order-of-magnitude below peak. Our insight is that if the nature of inductive dependences and memory accesses were explicit in the hardware/software interface, then a spatial architecture could efficiently execute parallel code regions. To this end, we first develop a novel execution model, inductive dataflow, where inductive dependence patterns and memory access patterns (streams) are first-order primitives. Second, we develop a hybrid spatial architecture combining systolic and tagged dataflow execution to attain high utilization at low energy and area cost. Finally, we create a scalable design through a novel vector-stream control model which amortizes control overhead both in time and spatially across architecture lanes. We evaluate our design, REVEL, with a full stack (compiler, ISA, simulator, RTL). Across a suite of linear algebra kernels, REVEL outperforms equally-provisioned DSPs by 4.6×-37×. Compared to state-of-the-art spatial architectures, REVEL is mean 3× faster. Compared to a set of ASICs, REVEL is only 2× the power and half the area.

查看原文本刊更多论文

一种用于归纳矩阵算法的混合收缩-数据流体系结构

密集的线性代数核对无线至关重要，即将到来的5G只会放大它们的重要性。由于许多此类算法的归纳性质，并行性很难利用:并行区域具有细粒度的生产者/消费者交互，并且迭代地改变依赖距离、重用率和内存访问模式。这使得多线程由于细粒度同步而变得不切实际，而矢量化由于非矩形迭代域而变得无效。cpu、dsp和gpu的性能比峰值低一个数量级。我们的见解是，如果归纳依赖和内存访问的本质在硬件/软件接口中是显式的，那么空间架构可以有效地执行并行代码区域。为此，我们首先开发了一种新的执行模型——归纳数据流，其中归纳依赖模式和内存访问模式(流)是一阶原语。其次，我们开发了一种结合收缩和标记数据流执行的混合空间架构，以低能量和面积成本实现高利用率。最后，我们通过一种新颖的矢量流控制模型创建了一个可扩展的设计，该模型在时间和空间上分摊了跨架构通道的控制开销。我们用一个完整的堆栈(编译器，ISA，模拟器，RTL)来评估我们的设计REVEL。在一套线性代数内核中，REVEL的性能优于同等配置的dsp 4.6×-37×。与最先进的空间建筑相比，REVEL平均速度快3倍。与一组asic相比，REVEL的功耗只有asic的2倍，面积只有asic的一半。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量