Design of a Low Power Bfloat16 Pipelined MAC Unit for Deep Neural Network Applications

2021 IEEE Region 10 Symposium (TENSYMP) Pub Date : 2021-08-23 DOI:10.1109/TENSYMP52854.2021.9550912

Ankita Tiwari, G. Trivedi, P. Guha

{"title":"Design of a Low Power Bfloat16 Pipelined MAC Unit for Deep Neural Network Applications","authors":"Ankita Tiwari, G. Trivedi, P. Guha","doi":"10.1109/TENSYMP52854.2021.9550912","DOIUrl":null,"url":null,"abstract":"Evolution of artificial intelligence (AI) and advances in semiconductor technology has enabled us to design many complex systems ranging from IoT based applications to high performance compute engines. AI incorporates various application driven machine learning algorithms, in which floating point numbers are employed for the training of neural network models. However, few simpler number systems, such as fixed-point and integers, are employed in inference due to their smaller bit-width, which reduce area and power consumption at the cost of accuracy due to quantization. The usage of floating point MAC improves the accuracy, but it results in a larger area and more power consumption. In this paper, an area and power efficient pipelined Bfloat16 MAC is proposed aiming performance improvement of neural network applications. The proposed unit is able to handle overflow, underflow, and normalization efficiently. Additionally, computational accuracy of MAC is improved by increasing mantissa bit-width and by eliminating normalization in the intermediate stages. The proposed non-pipelined MAC utilizes 18.61% less resources as compared to similar architectures. The area and power of the proposed 16-bit nonpipelined Bfloat16 MAC is reduced by 5.21% and 32%, respectively, at 200 MHz as compared to another 16-bit nonpipelined Bfloat16 MAC reported in [26]. The area and power of our proposed MAC is improved by 38.6% and 93% at 200 MHz, and 7.1% and 11.52% at 01 GHz, when it is compared with a 16-bit pipelined posit MAC and a pipelined Bfloat16 MAC reported in [27], respectively.","PeriodicalId":137485,"journal":{"name":"2021 IEEE Region 10 Symposium (TENSYMP)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Region 10 Symposium (TENSYMP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TENSYMP52854.2021.9550912","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Evolution of artificial intelligence (AI) and advances in semiconductor technology has enabled us to design many complex systems ranging from IoT based applications to high performance compute engines. AI incorporates various application driven machine learning algorithms, in which floating point numbers are employed for the training of neural network models. However, few simpler number systems, such as fixed-point and integers, are employed in inference due to their smaller bit-width, which reduce area and power consumption at the cost of accuracy due to quantization. The usage of floating point MAC improves the accuracy, but it results in a larger area and more power consumption. In this paper, an area and power efficient pipelined Bfloat16 MAC is proposed aiming performance improvement of neural network applications. The proposed unit is able to handle overflow, underflow, and normalization efficiently. Additionally, computational accuracy of MAC is improved by increasing mantissa bit-width and by eliminating normalization in the intermediate stages. The proposed non-pipelined MAC utilizes 18.61% less resources as compared to similar architectures. The area and power of the proposed 16-bit nonpipelined Bfloat16 MAC is reduced by 5.21% and 32%, respectively, at 200 MHz as compared to another 16-bit nonpipelined Bfloat16 MAC reported in [26]. The area and power of our proposed MAC is improved by 38.6% and 93% at 200 MHz, and 7.1% and 11.52% at 01 GHz, when it is compared with a 16-bit pipelined posit MAC and a pipelined Bfloat16 MAC reported in [27], respectively.

查看原文本刊更多论文

面向深度神经网络的低功耗Bfloat16流水线MAC单元设计

人工智能(AI)的发展和半导体技术的进步使我们能够设计许多复杂的系统，从基于物联网的应用到高性能计算引擎。人工智能结合了各种应用驱动的机器学习算法，其中使用浮点数来训练神经网络模型。然而，很少有更简单的数字系统，如定点和整数，被用于推理，因为它们的比特宽度较小，从而减少了面积和功耗，但代价是量化的准确性。使用浮点MAC可以提高精度，但会导致更大的面积和更多的功耗。本文针对神经网络应用中的性能提升问题，提出了一种面积和功耗均较低的流水线式Bfloat16 MAC。所提出的单元能够有效地处理溢出、下溢和规范化。此外，通过增加尾数位宽度和消除中间阶段的归一化，提高了MAC的计算精度。与类似的架构相比，所提出的非流水线MAC使用的资源减少了18.61%。与文献[26]中报道的另一种16位非流水线Bfloat16 MAC相比，所提出的16位非流水线Bfloat16 MAC在200 MHz时的面积和功耗分别降低了5.21%和32%。与文献[27]中报道的16位流水线正态MAC和流水线Bfloat16 MAC相比，我们提出的MAC在200 MHz时的面积和功耗分别提高了38.6%和93%，在01 GHz时分别提高了7.1%和11.52%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE Region 10 Symposium (TENSYMP)

自引率

0.00%

发文量