MASL-AFU：用于设备上DNN训练的高内存访问效率二维可扩展lut激活函数单元

IF 2.8 2区工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Very Large Scale Integration (VLSI) Systems Pub Date : 2024-11-25 DOI:10.1109/TVLSI.2024.3488782

Zhaoteng Meng;Lin Shu;Jianing Zeng;Zhan Li;Kailin Lv;Haoyue Yang;Jie Hao

{"title":"MASL-AFU：用于设备上DNN训练的高内存访问效率二维可扩展lut激活函数单元","authors":"Zhaoteng Meng;Lin Shu;Jianing Zeng;Zhan Li;Kailin Lv;Haoyue Yang;Jie Hao","doi":"10.1109/TVLSI.2024.3488782","DOIUrl":null,"url":null,"abstract":"On-device deep neural network (DNN) training faces constraints in storage capacity and energy supply. Existing works primarily focus on optimizing the training of convolutional and batch normalization (BN) layers to improve the compute-to-communication (CTC) ratio and reduce the energy cost of off-chip memory access (MA). However, the training of activation layers remains challenging due to the additional off-chip MA required for derivative calculations. This article proposes MASL-AFU, an architecture designed to accelerate the activation layer in on-device DNN training. MASL-AFU leverages nonuniform piecewise linear (NUPWL) functions to speed up the forward propagation (FP) in the activation layer. During the error propagation (EP) process, retrieving derivatives from a lookup table (LUT) eliminates the need for redundant retrieval of the input data used in FP. By storing LUT indices instead of the original activation inputs, MASL-AFU significantly reduces and accelerates MA. Compared to other activation function units, MASL-AFU offers up to a <inline-formula> <tex-math>$5.8\\times $ </tex-math></inline-formula> increase in computational and off-chip MA efficiency. In addition, MASL-AFU incorporates two dimensions of scalability: data precision and the number of LUT entries. These scalable, hardware-friendly methods enhance MASL-AFU’s area efficiency by up to <inline-formula> <tex-math>$3.24\\times $ </tex-math></inline-formula> and energy efficiency by up to <inline-formula> <tex-math>$3.85\\times $ </tex-math></inline-formula>.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"33 3","pages":"707-719"},"PeriodicalIF":2.8000,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MASL-AFU: A High Memory Access Efficiency 2-D Scalable LUT-Based Activation Function Unit for On-Device DNN Training\",\"authors\":\"Zhaoteng Meng;Lin Shu;Jianing Zeng;Zhan Li;Kailin Lv;Haoyue Yang;Jie Hao\",\"doi\":\"10.1109/TVLSI.2024.3488782\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"On-device deep neural network (DNN) training faces constraints in storage capacity and energy supply. Existing works primarily focus on optimizing the training of convolutional and batch normalization (BN) layers to improve the compute-to-communication (CTC) ratio and reduce the energy cost of off-chip memory access (MA). However, the training of activation layers remains challenging due to the additional off-chip MA required for derivative calculations. This article proposes MASL-AFU, an architecture designed to accelerate the activation layer in on-device DNN training. MASL-AFU leverages nonuniform piecewise linear (NUPWL) functions to speed up the forward propagation (FP) in the activation layer. During the error propagation (EP) process, retrieving derivatives from a lookup table (LUT) eliminates the need for redundant retrieval of the input data used in FP. By storing LUT indices instead of the original activation inputs, MASL-AFU significantly reduces and accelerates MA. Compared to other activation function units, MASL-AFU offers up to a <inline-formula> <tex-math>$5.8\\\\times $ </tex-math></inline-formula> increase in computational and off-chip MA efficiency. In addition, MASL-AFU incorporates two dimensions of scalability: data precision and the number of LUT entries. These scalable, hardware-friendly methods enhance MASL-AFU’s area efficiency by up to <inline-formula> <tex-math>$3.24\\\\times $ </tex-math></inline-formula> and energy efficiency by up to <inline-formula> <tex-math>$3.85\\\\times $ </tex-math></inline-formula>.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"33 3\",\"pages\":\"707-719\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-11-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10766892/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10766892/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

设备上深度神经网络（DNN）的训练面临存储容量和能量供应的限制。现有的工作主要集中在优化卷积和批归一化（BN）层的训练，以提高计算机对通信（CTC）比率和降低片外存储器访问（MA）的能量成本。然而，由于导数计算需要额外的片外MA，激活层的训练仍然具有挑战性。本文提出了MASL-AFU，一种旨在加速设备上DNN训练中的激活层的架构。MASL-AFU利用非均匀分段线性（NUPWL）函数来加速激活层中的前向传播（FP）。在错误传播（EP）过程中，从查找表（LUT）检索导数消除了对FP中使用的输入数据进行冗余检索的需要。通过存储LUT索引而不是原始激活输入，MASL-AFU显著减少并加速了MA。与其他激活功能单元相比，MASL-AFU在计算和片外MA效率方面提高了5.8倍。此外，MASL-AFU还包含两个可伸缩性维度：数据精度和LUT条目的数量。这些可扩展的，硬件友好的方法将MASL-AFU的面积效率提高了3.24美元，能源效率提高了3.85美元。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MASL-AFU: A High Memory Access Efficiency 2-D Scalable LUT-Based Activation Function Unit for On-Device DNN Training

On-device deep neural network (DNN) training faces constraints in storage capacity and energy supply. Existing works primarily focus on optimizing the training of convolutional and batch normalization (BN) layers to improve the compute-to-communication (CTC) ratio and reduce the energy cost of off-chip memory access (MA). However, the training of activation layers remains challenging due to the additional off-chip MA required for derivative calculations. This article proposes MASL-AFU, an architecture designed to accelerate the activation layer in on-device DNN training. MASL-AFU leverages nonuniform piecewise linear (NUPWL) functions to speed up the forward propagation (FP) in the activation layer. During the error propagation (EP) process, retrieving derivatives from a lookup table (LUT) eliminates the need for redundant retrieval of the input data used in FP. By storing LUT indices instead of the original activation inputs, MASL-AFU significantly reduces and accelerates MA. Compared to other activation function units, MASL-AFU offers up to a

$5.8\times $

increase in computational and off-chip MA efficiency. In addition, MASL-AFU incorporates two dimensions of scalability: data precision and the number of LUT entries. These scalable, hardware-friendly methods enhance MASL-AFU’s area efficiency by up to

$3.24\times $

and energy efficiency by up to

$3.85\times $

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Very Large Scale Integration (VLSI) Systems 工程技术-工程：电子与电气

CiteScore

6.40

自引率

7.10%

发文量

187

审稿时长

3.6 months

期刊介绍： The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.