Parallel Accurate Minifloat MACCs for Neural Network Inference on Versal FPGAs

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems Pub Date : 2024-12-04 DOI:10.1109/TCAD.2024.3511343

Hans Jakob Damsgaard;Konstantin J. Hoßfeld;Jari Nurmi;Thomas B. Preußer

{"title":"Parallel Accurate Minifloat MACCs for Neural Network Inference on Versal FPGAs","authors":"Hans Jakob Damsgaard;Konstantin J. Hoßfeld;Jari Nurmi;Thomas B. Preußer","doi":"10.1109/TCAD.2024.3511343","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) is ubiquitous in contemporary applications. Its need for efficient acceleration has driven vast research efforts into the quantization of neural networks with low-precision numerical formats. Models quantized with minifloat formats of eight or fewer bits have proven capable of outperforming models quantized into same-size integers. However, unlike integers, minifloats require accurate accumulation to prevent the introduction of rounding errors. We explore the design space of parallel accurate minifloat multiply-accumulators (MACCs) targeting the AMD VersalTM FPGA fabric. We experiment with three variations of the multiply-and-shift and adder tree components of a minifloat MACC. For comparison, we apply similar alterations to a parallel integer MACC. Our results show that custom compressor trees with external sign-inversion gates reduce the mean area of the minifloat MACCs by 17.7% and increase their clock frequency by 16.2%. In comparison, custom compressor trees with absorbed partial product generation gates reduce the mean area of integer MACCs by 28.1% and increase their clock frequency by 3.60%. Comparing the best-performing designs, we observe that minifloat MACCs consume 20% to 180% more resources than integer ones with same-size operands without accounting for a conversion back into a floating-point format, and 60% to 300% more resources when including it. Our data enable engineers to make informed decisions in their designs of deeply integrated embedded ML solutions when trading off training and fine-tuning effort versus resource cost.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"44 6","pages":"2181-2194"},"PeriodicalIF":2.7000,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10777058","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10777058/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning (ML) is ubiquitous in contemporary applications. Its need for efficient acceleration has driven vast research efforts into the quantization of neural networks with low-precision numerical formats. Models quantized with minifloat formats of eight or fewer bits have proven capable of outperforming models quantized into same-size integers. However, unlike integers, minifloats require accurate accumulation to prevent the introduction of rounding errors. We explore the design space of parallel accurate minifloat multiply-accumulators (MACCs) targeting the AMD VersalTM FPGA fabric. We experiment with three variations of the multiply-and-shift and adder tree components of a minifloat MACC. For comparison, we apply similar alterations to a parallel integer MACC. Our results show that custom compressor trees with external sign-inversion gates reduce the mean area of the minifloat MACCs by 17.7% and increase their clock frequency by 16.2%. In comparison, custom compressor trees with absorbed partial product generation gates reduce the mean area of integer MACCs by 28.1% and increase their clock frequency by 3.60%. Comparing the best-performing designs, we observe that minifloat MACCs consume 20% to 180% more resources than integer ones with same-size operands without accounting for a conversion back into a floating-point format, and 60% to 300% more resources when including it. Our data enable engineers to make informed decisions in their designs of deeply integrated embedded ML solutions when trading off training and fine-tuning effort versus resource cost.

查看原文本刊更多论文

通用fpga上用于神经网络推理的并行精确微型浮动mcc

机器学习（ML）在当代应用中无处不在。它对有效加速的需求推动了大量的研究工作，以低精度的数值格式进行神经网络的量化。使用8位或更少的minifloat格式量化的模型已被证明能够优于量化为相同大小整数的模型。然而，与整数不同的是，迷你浮点数需要精确的累加，以防止引入舍入误差。针对AMD VersalTM FPGA结构，探讨了并行精确微型浮动乘法累加器（MACCs）的设计空间。我们实验了微型浮动MACC的乘移和加法器树组件的三种变体。为了进行比较，我们对并行整数MACC应用类似的更改。我们的研究结果表明，带有外部符号反转门的定制压缩树使微型浮动MACCs的平均面积减少了17.7%，时钟频率增加了16.2%。相比之下，具有吸收部分乘积生成门的定制压缩树使整数macc的平均面积减少了28.1%，时钟频率提高了3.60%。比较性能最好的设计，我们观察到minfloat MACCs比具有相同大小操作数的整数MACCs多消耗20%到180%的资源（不考虑转换回浮点格式），并且在包含它时多消耗60%到300%的资源。我们的数据使工程师能够在设计深度集成的嵌入式机器学习解决方案时做出明智的决策，同时权衡培训和微调工作与资源成本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 工程技术-工程：电子与电气

CiteScore

5.60

自引率

13.80%

发文量

500

审稿时长

7 months

期刊介绍： The purpose of this Transactions is to publish papers of interest to individuals in the area of computer-aided design of integrated circuits and systems composed of analog, digital, mixed-signal, optical, or microwave components. The aids include methods, models, algorithms, and man-machine interfaces for system-level, physical and logical design including: planning, synthesis, partitioning, modeling, simulation, layout, verification, testing, hardware-software co-design and documentation of integrated circuit and system designs of all complexities. Design tools and techniques for evaluating and designing integrated circuits and systems for metrics such as performance, power, reliability, testability, and security are a focus.