DaCO: A High-Performance Token Dataflow Coprocessor Overlay for FPGAs

2018 International Conference on Field-Programmable Technology (FPT) Pub Date : 2018-12-01 DOI:10.1109/FPT.2018.00032

Siddhartha, Nachiket Kapre

{"title":"DaCO: A High-Performance Token Dataflow Coprocessor Overlay for FPGAs","authors":"Siddhartha, Nachiket Kapre","doi":"10.1109/FPT.2018.00032","DOIUrl":null,"url":null,"abstract":"Dataflow computing architectures exploit dynamic parallelism at the fine granularity of individual operations and provide a pathway to overcome the performance and energy limits of conventional von Neumann models. In this vein, we present DaCO (Dataflow Coprocessor FPGA Overlay), a high-performance compute organization for FPGAs to deliver up to 2.5x speedup over existing dataflow alternatives. Historically, dataflow-style execution has been viewed as an attractive parallel computing paradigm due to the self-timed, decentralized nature of implementation of dataflow dependencies and an absence of sequential program counters. However, realising high-performance dataflow computers has remained elusive largely due to the complexity of scheduling this parallelism and data communication bottlenecks. DaCO achieves this by (1) supporting large-scale (1000s of nodes) out-of-order scheduling using hierarchical lookup, (2) priority-aware routing of dataflow dependencies using the efficient Hoplite-Q NoC, and (3) clustering techniques to exploit data locality in the communication network organization. Each DaCO processing element is a programmable soft processor and it communicates with others using a packet-switching network-on-chip (PSNoC). We target the Arria 10 AX115S FPGA to take advantage of the hard floating-point DSP blocks, and maximize performance by multipumping the M20K Block RAMs. Overall, we can scale DaCO to 450 processors operating at an fmax of 250 MHz on the target platform. Each soft processor consumes 779 ALMs, 4 M20K BRAMs, and 3 hard floating-point DSP blocks for optimum balance, while the on-chip communication framework consumes < 15% of the on-chip resources.","PeriodicalId":434541,"journal":{"name":"2018 International Conference on Field-Programmable Technology (FPT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 International Conference on Field-Programmable Technology (FPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPT.2018.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Dataflow computing architectures exploit dynamic parallelism at the fine granularity of individual operations and provide a pathway to overcome the performance and energy limits of conventional von Neumann models. In this vein, we present DaCO (Dataflow Coprocessor FPGA Overlay), a high-performance compute organization for FPGAs to deliver up to 2.5x speedup over existing dataflow alternatives. Historically, dataflow-style execution has been viewed as an attractive parallel computing paradigm due to the self-timed, decentralized nature of implementation of dataflow dependencies and an absence of sequential program counters. However, realising high-performance dataflow computers has remained elusive largely due to the complexity of scheduling this parallelism and data communication bottlenecks. DaCO achieves this by (1) supporting large-scale (1000s of nodes) out-of-order scheduling using hierarchical lookup, (2) priority-aware routing of dataflow dependencies using the efficient Hoplite-Q NoC, and (3) clustering techniques to exploit data locality in the communication network organization. Each DaCO processing element is a programmable soft processor and it communicates with others using a packet-switching network-on-chip (PSNoC). We target the Arria 10 AX115S FPGA to take advantage of the hard floating-point DSP blocks, and maximize performance by multipumping the M20K Block RAMs. Overall, we can scale DaCO to 450 processors operating at an fmax of 250 MHz on the target platform. Each soft processor consumes 779 ALMs, 4 M20K BRAMs, and 3 hard floating-point DSP blocks for optimum balance, while the on-chip communication framework consumes < 15% of the on-chip resources.

查看原文本刊更多论文

fpga的高性能令牌数据流协处理器覆盖层

数据流计算架构在单个操作的细粒度上利用动态并行性，并提供了克服传统冯·诺伊曼模型的性能和能量限制的途径。在这方面，我们提出了DaCO(数据流协处理器FPGA覆盖层)，这是一种高性能的FPGA计算组织，比现有的数据流替代方案提供高达2.5倍的加速。从历史上看，数据流风格的执行一直被视为一种有吸引力的并行计算范式，因为数据流依赖关系的实现具有自定时、分散的特性，并且没有顺序程序计数器。然而，由于调度并行性的复杂性和数据通信瓶颈，实现高性能数据流计算机仍然难以捉摸。DaCO通过(1)使用分层查找支持大规模(1000个节点)无序调度，(2)使用高效的Hoplite-Q NoC实现数据流依赖关系的优先级感知路由，以及(3)集群技术利用通信网络组织中的数据局部性来实现这一点。每个DaCO处理单元都是一个可编程的软处理器，它使用包交换片上网络(PSNoC)与其他单元通信。我们的目标是Arria 10 AX115S FPGA利用硬浮点DSP块，并通过多泵送M20K块ram来最大化性能。总体而言，我们可以将DaCO扩展到450个处理器，在目标平台上以最高250 MHz的速度运行。每个软处理器消耗779个alm, 4个M20K bram, 3个硬浮点DSP块，达到最佳平衡，而片上通信框架消耗的片上资源< 15%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 International Conference on Field-Programmable Technology (FPT)

自引率

0.00%

发文量