{"title":"一种高效的变压器层归一化动态量化训练模块","authors":"Haikuo Shao;Aotao Wang;Zhongfeng Wang","doi":"10.1109/TCSII.2025.3591633","DOIUrl":null,"url":null,"abstract":"Layer normalization (LN) function is widely adopted in Transformer-based neural networks. The efficient training of Transformers on personal devices is attracting attention for data privacy and latency concerns. However, the critical LN function involves extreme outliers for quantization, as well as hardware-unfriendly square-root and division operations, posing resource challenges for training deployment on the edge. This brief proposes an efficient LN training architecture with algorithm and hardware co-optimization. Specifically, we present a dynamic quantized algorithm based on integer arithmetics to smooth outliers for sufficient training accuracy. Then, we develop a reconfigurable hardware architecture to efficiently support various operations during LN training, with a vector-wise pipelined dataflow to improve hardware efficiency further. Experimental results show that our architecture achieves up to 0.25 and 1.0 Giga input per Second (GinS) in throughput at FPGA and ASIC platforms, respectively, outperforming prior works.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1288-1292"},"PeriodicalIF":4.9000,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Efficient Layer Normalization Training Module With Dynamic Quantization for Transformers\",\"authors\":\"Haikuo Shao;Aotao Wang;Zhongfeng Wang\",\"doi\":\"10.1109/TCSII.2025.3591633\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Layer normalization (LN) function is widely adopted in Transformer-based neural networks. The efficient training of Transformers on personal devices is attracting attention for data privacy and latency concerns. However, the critical LN function involves extreme outliers for quantization, as well as hardware-unfriendly square-root and division operations, posing resource challenges for training deployment on the edge. This brief proposes an efficient LN training architecture with algorithm and hardware co-optimization. Specifically, we present a dynamic quantized algorithm based on integer arithmetics to smooth outliers for sufficient training accuracy. Then, we develop a reconfigurable hardware architecture to efficiently support various operations during LN training, with a vector-wise pipelined dataflow to improve hardware efficiency further. Experimental results show that our architecture achieves up to 0.25 and 1.0 Giga input per Second (GinS) in throughput at FPGA and ASIC platforms, respectively, outperforming prior works.\",\"PeriodicalId\":13101,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems II: Express Briefs\",\"volume\":\"72 9\",\"pages\":\"1288-1292\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems II: Express Briefs\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11089956/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems II: Express Briefs","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11089956/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
An Efficient Layer Normalization Training Module With Dynamic Quantization for Transformers
Layer normalization (LN) function is widely adopted in Transformer-based neural networks. The efficient training of Transformers on personal devices is attracting attention for data privacy and latency concerns. However, the critical LN function involves extreme outliers for quantization, as well as hardware-unfriendly square-root and division operations, posing resource challenges for training deployment on the edge. This brief proposes an efficient LN training architecture with algorithm and hardware co-optimization. Specifically, we present a dynamic quantized algorithm based on integer arithmetics to smooth outliers for sufficient training accuracy. Then, we develop a reconfigurable hardware architecture to efficiently support various operations during LN training, with a vector-wise pipelined dataflow to improve hardware efficiency further. Experimental results show that our architecture achieves up to 0.25 and 1.0 Giga input per Second (GinS) in throughput at FPGA and ASIC platforms, respectively, outperforming prior works.
期刊介绍:
TCAS II publishes brief papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes:
Circuits: Analog, Digital and Mixed Signal Circuits and Systems
Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic
Circuits and Systems, Power Electronics and Systems
Software for Analog-and-Logic Circuits and Systems
Control aspects of Circuits and Systems.