Tensor Slices: FPGA Building Blocks For The Deep Learning Era

ACM Transactions on Reconfigurable Technology and Systems (TRETS) Pub Date : 2022-04-13 DOI:10.1145/3529650

Aman Arora, Moinak Ghosh, Samidh Mehta, Vaughn Betz, L. John

{"title":"Tensor Slices: FPGA Building Blocks For The Deep Learning Era","authors":"Aman Arora, Moinak Ghosh, Samidh Mehta, Vaughn Betz, L. John","doi":"10.1145/3529650","DOIUrl":null,"url":null,"abstract":"FPGAs are well-suited for accelerating deep learning (DL) applications owing to the rapidly changing algorithms, network architectures and computation requirements in this field. However, the generic building blocks available on traditional FPGAs limit the acceleration that can be achieved. Many modifications to FPGA architecture have been proposed and deployed including adding specialized artificial intelligence (AI) processing engines, adding support for smaller precision math like 8-bit fixed point and IEEE half-precision (fp16) in DSP slices, adding shadow multipliers in logic blocks, etc. In this paper, we describe replacing a portion of the FPGA’s programmable logic area with Tensor Slices. These slices have a systolic array of processing elements at their heart that support multiple tensor operations, multiple dynamically-selectable precisions and can be dynamically fractured into individual multipliers and MACs (multiply-and-accumulate). These slices have a local crossbar at the inputs that helps with easing the routing pressure caused by a large block on the FPGA. Adding these DL-specific coarse-grained hard blocks to FPGAs increases their compute density and makes them even better hardware accelerators for DL applications, while still keeping the vast majority of the real estate on the FPGA programmable at fine-grain.","PeriodicalId":162787,"journal":{"name":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Reconfigurable Technology and Systems (TRETS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3529650","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

FPGAs are well-suited for accelerating deep learning (DL) applications owing to the rapidly changing algorithms, network architectures and computation requirements in this field. However, the generic building blocks available on traditional FPGAs limit the acceleration that can be achieved. Many modifications to FPGA architecture have been proposed and deployed including adding specialized artificial intelligence (AI) processing engines, adding support for smaller precision math like 8-bit fixed point and IEEE half-precision (fp16) in DSP slices, adding shadow multipliers in logic blocks, etc. In this paper, we describe replacing a portion of the FPGA’s programmable logic area with Tensor Slices. These slices have a systolic array of processing elements at their heart that support multiple tensor operations, multiple dynamically-selectable precisions and can be dynamically fractured into individual multipliers and MACs (multiply-and-accumulate). These slices have a local crossbar at the inputs that helps with easing the routing pressure caused by a large block on the FPGA. Adding these DL-specific coarse-grained hard blocks to FPGAs increases their compute density and makes them even better hardware accelerators for DL applications, while still keeping the vast majority of the real estate on the FPGA programmable at fine-grain.

查看原文本刊更多论文

由于该领域快速变化的算法、网络架构和计算需求，fpga非常适合加速深度学习(DL)应用。然而，传统fpga上可用的通用构建块限制了可以实现的加速。对FPGA架构的许多修改已经提出并部署，包括增加专门的人工智能(AI)处理引擎，在DSP切片中增加对较小精度数学的支持，如8位定点和IEEE半精度(fp16)，在逻辑块中增加阴影乘法器等。在本文中，我们描述了用张量切片取代FPGA可编程逻辑区域的一部分。这些切片的心脏有一个收缩数组的处理元素，支持多个张量操作，多个动态选择的精度，可以动态地分解成单个乘数和mac(乘和累加)。这些片在输入处有一个本地交叉条，有助于缓解由FPGA上的大块引起的路由压力。将这些特定于DL的粗粒度硬块添加到FPGA中可以提高它们的计算密度，使它们成为DL应用程序的更好的硬件加速器，同时仍然保持FPGA上的绝大部分可编程的细粒度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Reconfigurable Technology and Systems (TRETS)

自引率

0.00%

发文量