7.7 LNPU: 25.3TFLOPS/W的稀疏深度神经网络学习处理器，具有FP8-FP16的细粒度混合精度

2019 IEEE International Solid- State Circuits Conference - (ISSCC) Pub Date : 2019-02-01 DOI:10.1109/ISSCC.2019.8662302

Jinsu Lee, Juhyoung Lee, Donghyeon Han, Jinmook Lee, Gwangtae Park, H. Yoo

{"title":"7.7 LNPU: 25.3TFLOPS/W的稀疏深度神经网络学习处理器，具有FP8-FP16的细粒度混合精度","authors":"Jinsu Lee, Juhyoung Lee, Donghyeon Han, Jinmook Lee, Gwangtae Park, H. Yoo","doi":"10.1109/ISSCC.2019.8662302","DOIUrl":null,"url":null,"abstract":"Recently, deep neural network (DNN) hardware accelerators have been reported for energy-efficient deep learning (DL) acceleration [1–6]. Most prior DNN inference accelerators are trained in the cloud using public datasets; parameters are then downloaded to implement AI [1–5]. However, local DNN learning with domain-specific and private data is required meet various user preferences on edge or mobile devices. Since edge and mobile devices contain only limited computation capability with battery power, an energy-efficient DNN learning processor is necessary. Only [6] supported on-chip DNN learning, but it was not energy-efficient, as it did not utilize sparsity which represents 37%-61% of the inputs for various CNNs, such as VGG16, AlexNet and ResNet-18, as shown in Fig. 7.7.1. Although [3–5] utilized the sparsity, they only considered the inference phase with inter-channel accumulation in Fig. 7.7.1, and did not support intra-channel accumulation for the weight-gradient generation (WG) step of the learning phase. Also, [6] adopted FP16, but it was not energy optimal because FP8 is enough for many input operands with 4× less energy than FP16.","PeriodicalId":265551,"journal":{"name":"2019 IEEE International Solid- State Circuits Conference - (ISSCC)","volume":"298 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"96","resultStr":"{\"title\":\"7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16\",\"authors\":\"Jinsu Lee, Juhyoung Lee, Donghyeon Han, Jinmook Lee, Gwangtae Park, H. Yoo\",\"doi\":\"10.1109/ISSCC.2019.8662302\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, deep neural network (DNN) hardware accelerators have been reported for energy-efficient deep learning (DL) acceleration [1–6]. Most prior DNN inference accelerators are trained in the cloud using public datasets; parameters are then downloaded to implement AI [1–5]. However, local DNN learning with domain-specific and private data is required meet various user preferences on edge or mobile devices. Since edge and mobile devices contain only limited computation capability with battery power, an energy-efficient DNN learning processor is necessary. Only [6] supported on-chip DNN learning, but it was not energy-efficient, as it did not utilize sparsity which represents 37%-61% of the inputs for various CNNs, such as VGG16, AlexNet and ResNet-18, as shown in Fig. 7.7.1. Although [3–5] utilized the sparsity, they only considered the inference phase with inter-channel accumulation in Fig. 7.7.1, and did not support intra-channel accumulation for the weight-gradient generation (WG) step of the learning phase. Also, [6] adopted FP16, but it was not energy optimal because FP8 is enough for many input operands with 4× less energy than FP16.\",\"PeriodicalId\":265551,\"journal\":{\"name\":\"2019 IEEE International Solid- State Circuits Conference - (ISSCC)\",\"volume\":\"298 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"96\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE International Solid- State Circuits Conference - (ISSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC.2019.8662302\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Solid- State Circuits Conference - (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC.2019.8662302","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 96

摘要

最近，深度神经网络(DNN)硬件加速器被报道用于节能深度学习(DL)加速[1-6]。大多数先前的DNN推理加速器是使用公共数据集在云中训练的;然后下载参数来实现AI[1-5]。然而，需要使用特定领域和私有数据进行局部深度神经网络学习，以满足边缘或移动设备上的各种用户偏好。由于边缘设备和移动设备只有有限的电池计算能力，因此需要一个节能的深度神经网络学习处理器。只有[6]支持片上DNN学习，但它并不节能，因为它没有利用稀疏性，稀疏性占各种cnn(如VGG16, AlexNet和ResNet-18)输入的37%-61%，如图7.7.1所示。虽然[3-5]利用了稀疏性，但他们只考虑了图7.7.1中具有通道间积累的推理阶段，而不支持学习阶段的权重梯度生成(WG)步骤的通道内积累。[6]也采用了FP16，但它不是能量最优的，因为FP8可以满足许多输入操作数，而能量比FP16少4倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16

Recently, deep neural network (DNN) hardware accelerators have been reported for energy-efficient deep learning (DL) acceleration [1–6]. Most prior DNN inference accelerators are trained in the cloud using public datasets; parameters are then downloaded to implement AI [1–5]. However, local DNN learning with domain-specific and private data is required meet various user preferences on edge or mobile devices. Since edge and mobile devices contain only limited computation capability with battery power, an energy-efficient DNN learning processor is necessary. Only [6] supported on-chip DNN learning, but it was not energy-efficient, as it did not utilize sparsity which represents 37%-61% of the inputs for various CNNs, such as VGG16, AlexNet and ResNet-18, as shown in Fig. 7.7.1. Although [3–5] utilized the sparsity, they only considered the inference phase with inter-channel accumulation in Fig. 7.7.1, and did not support intra-channel accumulation for the weight-gradient generation (WG) step of the learning phase. Also, [6] adopted FP16, but it was not energy optimal because FP8 is enough for many input operands with 4× less energy than FP16.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE International Solid- State Circuits Conference - (ISSCC)

自引率

0.00%

发文量