Tensor Slices to the Rescue: Supercharging ML Acceleration on FPGAs

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2021-02-17 DOI:10.1145/3431920.3439282

Aman Arora, Samidh Mehta, Vaughn Betz, L. John

{"title":"Tensor Slices to the Rescue: Supercharging ML Acceleration on FPGAs","authors":"Aman Arora, Samidh Mehta, Vaughn Betz, L. John","doi":"10.1145/3431920.3439282","DOIUrl":null,"url":null,"abstract":"FPGAs are well-suited for accelerating deep learning (DL) applications owing to the rapidly changing algorithms, network architectures and computation requirements in this field. However, the generic building blocks available on traditional FPGAs limit the acceleration that can be achieved. Many modifications to FPGA architecture have been proposed and deployed including adding specialized artificial intelligence (AI) processing engines, adding support for IEEE half-precision (fp16) math in DSP slices, adding hard matrix multiplier blocks, etc. In this paper, we describe replacing a small percentage of the FPGA's programmable logic area with Tensor Slices. These slices are arrays of processing elements at their heart that support multiple tensor operations, multiple dynamically-selectable precisions and can be dynamically fractured into individual adders, multipliers and MACs (multiply-and-accumulate). These tiles have a local crossbar at the inputs that helps with easing the routing pressure caused by a large slice. By spending ~3% of FPGA's area on Tensor Slices, we observe an average frequency increase of 2.45x and average area reduction by 0.41x across several ML benchmarks, including a TPU-like design, compared to an Intel Agilex-like baseline FPGA. We also study the impact of spending area on Tensor slices on non-ML applications. We observe an average reduction of 1% in frequency and an average increase of 1% in routing wirelength compared to the baseline, across the non-ML benchmarks we studied. Adding these ML-specific coarse-grained hard blocks makes the proposed FPGA a much efficient hardware accelerator for ML applications, while still keeping the vast majority of the real estate on the FPGA programmable at fine-grain.","PeriodicalId":386071,"journal":{"name":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3431920.3439282","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

FPGAs are well-suited for accelerating deep learning (DL) applications owing to the rapidly changing algorithms, network architectures and computation requirements in this field. However, the generic building blocks available on traditional FPGAs limit the acceleration that can be achieved. Many modifications to FPGA architecture have been proposed and deployed including adding specialized artificial intelligence (AI) processing engines, adding support for IEEE half-precision (fp16) math in DSP slices, adding hard matrix multiplier blocks, etc. In this paper, we describe replacing a small percentage of the FPGA's programmable logic area with Tensor Slices. These slices are arrays of processing elements at their heart that support multiple tensor operations, multiple dynamically-selectable precisions and can be dynamically fractured into individual adders, multipliers and MACs (multiply-and-accumulate). These tiles have a local crossbar at the inputs that helps with easing the routing pressure caused by a large slice. By spending ~3% of FPGA's area on Tensor Slices, we observe an average frequency increase of 2.45x and average area reduction by 0.41x across several ML benchmarks, including a TPU-like design, compared to an Intel Agilex-like baseline FPGA. We also study the impact of spending area on Tensor slices on non-ML applications. We observe an average reduction of 1% in frequency and an average increase of 1% in routing wirelength compared to the baseline, across the non-ML benchmarks we studied. Adding these ML-specific coarse-grained hard blocks makes the proposed FPGA a much efficient hardware accelerator for ML applications, while still keeping the vast majority of the real estate on the FPGA programmable at fine-grain.

查看原文本刊更多论文

张量切片的拯救:fpga上的增压ML加速

由于该领域快速变化的算法、网络架构和计算需求，fpga非常适合加速深度学习(DL)应用。然而，传统fpga上可用的通用构建块限制了可以实现的加速。对FPGA架构进行了许多修改，包括增加专门的人工智能(AI)处理引擎，在DSP切片中增加对IEEE半精度(fp16)数学的支持，增加硬矩阵乘法器块等。在本文中，我们描述了用张量切片取代FPGA的一小部分可编程逻辑区域。这些切片是处理元素的数组，其核心支持多个张量操作，多个动态可选精度，并且可以动态分解为单个加法器，乘法器和mac(乘和累加)。这些瓷砖在输入处有一个本地横杆，有助于缓解由大切片引起的路由压力。通过在张量切片上花费约3%的FPGA面积，我们观察到在几个ML基准测试中，包括类似tpu的设计，与类似英特尔agilex的基线FPGA相比，平均频率增加了2.45倍，平均面积减少了0.41倍。我们还研究了张量切片上的花费区域对非ml应用的影响。我们观察到，在我们研究的非ml基准测试中，与基线相比，频率平均降低1%，路由长度平均增加1%。添加这些ML特定的粗粒度硬块使所提出的FPGA成为ML应用程序的高效硬件加速器，同时仍然保持FPGA上的绝大部分可编程在细粒度上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量