An On-Chip Fully Connected Neural Network Training Hardware Accelerator Based on Brain Float Point and Sparsity Awareness

IF 2.4 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE open journal of circuits and systems Pub Date : 2023-01-01 DOI:10.1109/OJCAS.2023.3245061

Tsung-Han Tsai;Ding-Bang Lin

{"title":"An On-Chip Fully Connected Neural Network Training Hardware Accelerator Based on Brain Float Point and Sparsity Awareness","authors":"Tsung-Han Tsai;Ding-Bang Lin","doi":"10.1109/OJCAS.2023.3245061","DOIUrl":null,"url":null,"abstract":"In recent years, deep neural networks (DNNs) have brought revolutionary progress in various fields with the advent of technology. It is widely used in image pre-processing, image enhancement technology, face recognition, voice recognition, and other applications, gradually replacing traditional algorithms. It shows that the rise of neural networks has led to the reform of artificial intelligence. Since neural network algorithms are computationally intensive, they require GPUs or accelerated hardware for real-time computation. However, the high cost and high power consumption of GPUs result in low energy efficiency. It recently led to much research on accelerated digital circuit hardware design for deep neural networks. In this paper, we propose an efficient and flexible neural network training processor for fully connected layers. Our proposed training processor features low power consumption, high throughput, and high energy efficiency. It uses the sparsity of neuron activations to reduce the number of memory accesses and memory space to achieve an efficient training accelerator. The proposed processor uses a novel reconfigurable computing architecture to maintain high performance when operating Forward Propagation and Backward Propagation. The processor is implemented in Xilinx Zynq UltraSacle+MPSoC ZCU104 FPGA, with an operating frequency of 200MHz and power consumption of 6.444W, and can achieve 102.43 GOPS.","PeriodicalId":93442,"journal":{"name":"IEEE open journal of circuits and systems","volume":"4 ","pages":"85-98"},"PeriodicalIF":2.4000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8784029/10019301/10051716.pdf","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE open journal of circuits and systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10051716/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 4

Abstract

In recent years, deep neural networks (DNNs) have brought revolutionary progress in various fields with the advent of technology. It is widely used in image pre-processing, image enhancement technology, face recognition, voice recognition, and other applications, gradually replacing traditional algorithms. It shows that the rise of neural networks has led to the reform of artificial intelligence. Since neural network algorithms are computationally intensive, they require GPUs or accelerated hardware for real-time computation. However, the high cost and high power consumption of GPUs result in low energy efficiency. It recently led to much research on accelerated digital circuit hardware design for deep neural networks. In this paper, we propose an efficient and flexible neural network training processor for fully connected layers. Our proposed training processor features low power consumption, high throughput, and high energy efficiency. It uses the sparsity of neuron activations to reduce the number of memory accesses and memory space to achieve an efficient training accelerator. The proposed processor uses a novel reconfigurable computing architecture to maintain high performance when operating Forward Propagation and Backward Propagation. The processor is implemented in Xilinx Zynq UltraSacle+MPSoC ZCU104 FPGA, with an operating frequency of 200MHz and power consumption of 6.444W, and can achieve 102.43 GOPS.

查看原文本刊更多论文

基于脑浮点数和稀疏度感知的片上全连接神经网络训练硬件加速器

近年来，随着技术的发展，深度神经网络(dnn)在各个领域取得了革命性的进展。广泛应用于图像预处理、图像增强技术、人脸识别、语音识别等应用领域，逐渐取代传统算法。这表明神经网络的兴起导致了人工智能的变革。由于神经网络算法是计算密集型的，它们需要gpu或加速硬件来进行实时计算。然而，gpu的高成本和高功耗导致了低能效。它最近引发了许多关于深度神经网络加速数字电路硬件设计的研究。本文提出了一种高效、灵活的全连接层神经网络训练处理器。我们提出的训练处理器具有低功耗、高吞吐量和高能效的特点。它利用神经元激活的稀疏性来减少内存访问次数和内存空间，从而实现高效的训练加速器。该处理器采用了一种新颖的可重构计算架构，使其在进行前向传播和后向传播时都能保持高性能。该处理器采用Xilinx Zynq UltraSacle+MPSoC ZCU104 FPGA实现，工作频率为200MHz，功耗为6.444W，可实现102.43 GOPS。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE open journal of circuits and systems

自引率

0.00%

发文量

审稿时长

19 weeks