以物联网为中心的神经推理加速器设计探索

V. Parmar, M. Suri
{"title":"以物联网为中心的神经推理加速器设计探索","authors":"V. Parmar, M. Suri","doi":"10.1145/3194554.3194614","DOIUrl":null,"url":null,"abstract":"Neural networks have been successfully deployed in a variety of fields like computer vision, natural language processing, pattern recognition, etc. However most of their current deployments are suitable for cloud-based high-performance computing systems. As the computation of neural networks is not suited to traditional Von-Neumann CPU architectures, many novel hardware accelerator designs have been proposed in literature. In this paper we present the design of a novel, simplified and extensible neural inference engine for IoT systems. We present a detailed analysis on the impact of various design choices like technology node, computation block size, etc on overall performance of the neural inference engine. The paper demonstrates the first design instance of a power-optimized ELM neural network using ReLU activation. Comparison between learning performance of simulated hardware against the software model of the neural network shows a variation of ~ 1% in testing accuracy due to quantization. The accelerator compute blocks manage to achieve a performance per Watt of ~ 290 MSPS/W (Million samples per second per Watt) with a network structure of size: 8 x 32 x 2. Minimum energy of 40 pJ is acheived per sample processed for a block size of 16. Further, we show through simulations that an added power-saving of ~ 30 % can be acheived if SRAM based main memory is replaced with emerging STT-MRAM technology.","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Design Exploration of IoT centric Neural Inference Accelerators\",\"authors\":\"V. Parmar, M. Suri\",\"doi\":\"10.1145/3194554.3194614\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural networks have been successfully deployed in a variety of fields like computer vision, natural language processing, pattern recognition, etc. However most of their current deployments are suitable for cloud-based high-performance computing systems. As the computation of neural networks is not suited to traditional Von-Neumann CPU architectures, many novel hardware accelerator designs have been proposed in literature. In this paper we present the design of a novel, simplified and extensible neural inference engine for IoT systems. We present a detailed analysis on the impact of various design choices like technology node, computation block size, etc on overall performance of the neural inference engine. The paper demonstrates the first design instance of a power-optimized ELM neural network using ReLU activation. Comparison between learning performance of simulated hardware against the software model of the neural network shows a variation of ~ 1% in testing accuracy due to quantization. The accelerator compute blocks manage to achieve a performance per Watt of ~ 290 MSPS/W (Million samples per second per Watt) with a network structure of size: 8 x 32 x 2. Minimum energy of 40 pJ is acheived per sample processed for a block size of 16. Further, we show through simulations that an added power-saving of ~ 30 % can be acheived if SRAM based main memory is replaced with emerging STT-MRAM technology.\",\"PeriodicalId\":215940,\"journal\":{\"name\":\"Proceedings of the 2018 on Great Lakes Symposium on VLSI\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 on Great Lakes Symposium on VLSI\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3194554.3194614\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3194554.3194614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

神经网络已经成功地应用于计算机视觉、自然语言处理、模式识别等多个领域。然而,它们目前的大多数部署都适用于基于云的高性能计算系统。由于神经网络的计算不适合传统的冯-诺伊曼CPU体系结构,文献中提出了许多新的硬件加速器设计。在本文中,我们提出了一种新的、简化的、可扩展的物联网系统神经推理引擎的设计。我们详细分析了各种设计选择如技术节点、计算块大小等对神经推理机整体性能的影响。本文演示了采用ReLU激活的功率优化ELM神经网络的第一个设计实例。仿真硬件的学习性能与神经网络的软件模型的学习性能比较表明,由于量化,测试精度的变化约为1%。加速器计算块设法实现每瓦性能约290 MSPS/W(每瓦每秒百万样本),网络结构大小为8 x 32 x 2。对于块大小为16的样品,每个处理样品的最小能量为40 pJ。此外,我们通过模拟表明,如果将基于SRAM的主存储器替换为新兴的STT-MRAM技术,则可以实现约30%的额外节能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Design Exploration of IoT centric Neural Inference Accelerators
Neural networks have been successfully deployed in a variety of fields like computer vision, natural language processing, pattern recognition, etc. However most of their current deployments are suitable for cloud-based high-performance computing systems. As the computation of neural networks is not suited to traditional Von-Neumann CPU architectures, many novel hardware accelerator designs have been proposed in literature. In this paper we present the design of a novel, simplified and extensible neural inference engine for IoT systems. We present a detailed analysis on the impact of various design choices like technology node, computation block size, etc on overall performance of the neural inference engine. The paper demonstrates the first design instance of a power-optimized ELM neural network using ReLU activation. Comparison between learning performance of simulated hardware against the software model of the neural network shows a variation of ~ 1% in testing accuracy due to quantization. The accelerator compute blocks manage to achieve a performance per Watt of ~ 290 MSPS/W (Million samples per second per Watt) with a network structure of size: 8 x 32 x 2. Minimum energy of 40 pJ is acheived per sample processed for a block size of 16. Further, we show through simulations that an added power-saving of ~ 30 % can be acheived if SRAM based main memory is replaced with emerging STT-MRAM technology.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信