{"title":"以物联网为中心的神经推理加速器设计探索","authors":"V. Parmar, M. Suri","doi":"10.1145/3194554.3194614","DOIUrl":null,"url":null,"abstract":"Neural networks have been successfully deployed in a variety of fields like computer vision, natural language processing, pattern recognition, etc. However most of their current deployments are suitable for cloud-based high-performance computing systems. As the computation of neural networks is not suited to traditional Von-Neumann CPU architectures, many novel hardware accelerator designs have been proposed in literature. In this paper we present the design of a novel, simplified and extensible neural inference engine for IoT systems. We present a detailed analysis on the impact of various design choices like technology node, computation block size, etc on overall performance of the neural inference engine. The paper demonstrates the first design instance of a power-optimized ELM neural network using ReLU activation. Comparison between learning performance of simulated hardware against the software model of the neural network shows a variation of ~ 1% in testing accuracy due to quantization. The accelerator compute blocks manage to achieve a performance per Watt of ~ 290 MSPS/W (Million samples per second per Watt) with a network structure of size: 8 x 32 x 2. Minimum energy of 40 pJ is acheived per sample processed for a block size of 16. Further, we show through simulations that an added power-saving of ~ 30 % can be acheived if SRAM based main memory is replaced with emerging STT-MRAM technology.","PeriodicalId":215940,"journal":{"name":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Design Exploration of IoT centric Neural Inference Accelerators\",\"authors\":\"V. Parmar, M. Suri\",\"doi\":\"10.1145/3194554.3194614\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural networks have been successfully deployed in a variety of fields like computer vision, natural language processing, pattern recognition, etc. However most of their current deployments are suitable for cloud-based high-performance computing systems. As the computation of neural networks is not suited to traditional Von-Neumann CPU architectures, many novel hardware accelerator designs have been proposed in literature. In this paper we present the design of a novel, simplified and extensible neural inference engine for IoT systems. We present a detailed analysis on the impact of various design choices like technology node, computation block size, etc on overall performance of the neural inference engine. The paper demonstrates the first design instance of a power-optimized ELM neural network using ReLU activation. Comparison between learning performance of simulated hardware against the software model of the neural network shows a variation of ~ 1% in testing accuracy due to quantization. The accelerator compute blocks manage to achieve a performance per Watt of ~ 290 MSPS/W (Million samples per second per Watt) with a network structure of size: 8 x 32 x 2. Minimum energy of 40 pJ is acheived per sample processed for a block size of 16. Further, we show through simulations that an added power-saving of ~ 30 % can be acheived if SRAM based main memory is replaced with emerging STT-MRAM technology.\",\"PeriodicalId\":215940,\"journal\":{\"name\":\"Proceedings of the 2018 on Great Lakes Symposium on VLSI\",\"volume\":\"18 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 on Great Lakes Symposium on VLSI\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3194554.3194614\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 on Great Lakes Symposium on VLSI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3194554.3194614","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
摘要
神经网络已经成功地应用于计算机视觉、自然语言处理、模式识别等多个领域。然而,它们目前的大多数部署都适用于基于云的高性能计算系统。由于神经网络的计算不适合传统的冯-诺伊曼CPU体系结构,文献中提出了许多新的硬件加速器设计。在本文中,我们提出了一种新的、简化的、可扩展的物联网系统神经推理引擎的设计。我们详细分析了各种设计选择如技术节点、计算块大小等对神经推理机整体性能的影响。本文演示了采用ReLU激活的功率优化ELM神经网络的第一个设计实例。仿真硬件的学习性能与神经网络的软件模型的学习性能比较表明,由于量化,测试精度的变化约为1%。加速器计算块设法实现每瓦性能约290 MSPS/W(每瓦每秒百万样本),网络结构大小为8 x 32 x 2。对于块大小为16的样品,每个处理样品的最小能量为40 pJ。此外,我们通过模拟表明,如果将基于SRAM的主存储器替换为新兴的STT-MRAM技术,则可以实现约30%的额外节能。
Design Exploration of IoT centric Neural Inference Accelerators
Neural networks have been successfully deployed in a variety of fields like computer vision, natural language processing, pattern recognition, etc. However most of their current deployments are suitable for cloud-based high-performance computing systems. As the computation of neural networks is not suited to traditional Von-Neumann CPU architectures, many novel hardware accelerator designs have been proposed in literature. In this paper we present the design of a novel, simplified and extensible neural inference engine for IoT systems. We present a detailed analysis on the impact of various design choices like technology node, computation block size, etc on overall performance of the neural inference engine. The paper demonstrates the first design instance of a power-optimized ELM neural network using ReLU activation. Comparison between learning performance of simulated hardware against the software model of the neural network shows a variation of ~ 1% in testing accuracy due to quantization. The accelerator compute blocks manage to achieve a performance per Watt of ~ 290 MSPS/W (Million samples per second per Watt) with a network structure of size: 8 x 32 x 2. Minimum energy of 40 pJ is acheived per sample processed for a block size of 16. Further, we show through simulations that an added power-saving of ~ 30 % can be acheived if SRAM based main memory is replaced with emerging STT-MRAM technology.