MLBlocks: FPGA Blocks for Machine Learning Applications

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI:10.1145/3431920.3439479

Seyedramin Rasoulinezhad, D. Boland, P. Leong

{"title":"MLBlocks: FPGA Blocks for Machine Learning Applications","authors":"Seyedramin Rasoulinezhad, D. Boland, P. Leong","doi":"10.1145/3431920.3439479","DOIUrl":null,"url":null,"abstract":"The underlying goal of FPGA architecture research is to devise flexible substrates which implement a wide variety of circuits efficiently. Contemporary FPGA architectures have been optimized to support networking, signal processing and image processing applications through high precision digital signal processing (DSP) blocks. The recent emergence of machine learning has created a new set of demands characterized by: 1) higher computational density and 2) low precision arithmetic requirements. With the goal of exploring this new design space in a methodical manner, we first propose a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations, which covers many basic linear algebra primitives and standard deep neural network (DNN) layers. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then proposed together with a family of new compute units, called MLBlocks. These blocks are flexible mesh-based systolic array units parameterized with different data movements, data reuse, and multi-precision support. They utilize a columnar arrangement which is compatible with existing FPGA architectures. Finally, using synthetic benchmarks, we demonstrate that MLBlocks offer significantly improved performance over the commercial Xilinx DSP48E2, while maintaining similar area and timing requirements to current DSPs.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431920.3439479","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The underlying goal of FPGA architecture research is to devise flexible substrates which implement a wide variety of circuits efficiently. Contemporary FPGA architectures have been optimized to support networking, signal processing and image processing applications through high precision digital signal processing (DSP) blocks. The recent emergence of machine learning has created a new set of demands characterized by: 1) higher computational density and 2) low precision arithmetic requirements. With the goal of exploring this new design space in a methodical manner, we first propose a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations, which covers many basic linear algebra primitives and standard deep neural network (DNN) layers. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then proposed together with a family of new compute units, called MLBlocks. These blocks are flexible mesh-based systolic array units parameterized with different data movements, data reuse, and multi-precision support. They utilize a columnar arrangement which is compatible with existing FPGA architectures. Finally, using synthetic benchmarks, we demonstrate that MLBlocks offer significantly improved performance over the commercial Xilinx DSP48E2, while maintaining similar area and timing requirements to current DSPs.

查看原文本刊更多论文

MLBlocks:机器学习应用的FPGA模块

FPGA架构研究的根本目标是设计灵活的衬底，以有效地实现各种电路。当代FPGA架构已经过优化，通过高精度数字信号处理(DSP)块支持网络、信号处理和图像处理应用。最近出现的机器学习产生了一系列新的需求，其特点是:1)更高的计算密度和2)低精度的算术要求。为了以系统的方式探索这一新的设计空间，我们首先提出了一个涉及计算多重累积(MAC)操作上的嵌套循环的问题公式，它涵盖了许多基本的线性代数原语和标准深度神经网络(DNN)层。然后提出了一种定量方法，用于从基准测试中获得高效的粗粒度计算块体系结构，以及一系列称为mlblock的新计算单元。这些块是灵活的基于网格的收缩阵列单元，参数化了不同的数据移动、数据重用和多精度支持。它们采用与现有FPGA架构兼容的柱状排列。最后，使用合成基准测试，我们证明mlblock比商用Xilinx DSP48E2提供了显着改进的性能，同时保持了与当前dsp相似的面积和时序要求。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量