Ruoyang Liu;Wenxun Wang;Chen Tang;Weichen Gao;Huazhong Yang;Yongpan Liu
{"title":"张量类型和噪声强度感知精确调度扩散网络的全量化训练加速器","authors":"Ruoyang Liu;Wenxun Wang;Chen Tang;Weichen Gao;Huazhong Yang;Yongpan Liu","doi":"10.1109/TCSII.2024.3439319","DOIUrl":null,"url":null,"abstract":"Fine-grained mixed-precision fully-quantized methods have great potential to accelerate neural network training, but existing methods exhibit large accuracy loss for more complex models such as diffusion networks. This brief introduces a fully-quantized training accelerator for diffusion networks. It features a novel training framework with tensor-type- and noise-strength-aware precision scheduling to optimize bit-width allocation. The processing cluster design enables dynamical switching bit-width mappings for model weights, allows concurrent processing in 4 different bit-widths, and incorporates a gradient square sum collection unit to minimize on-chip memory access. Experimental results show up to 2.4\n<inline-formula> <tex-math>$\\times $ </tex-math></inline-formula>\n training speedup and 81% operation bit-width overhead reduction compared to existing designs, with minimal impact on image generation quality.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"71 12","pages":"4994-4998"},"PeriodicalIF":4.0000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Fully Quantized Training Accelerator for Diffusion Network With Tensor Type & Noise Strength Aware Precision Scheduling\",\"authors\":\"Ruoyang Liu;Wenxun Wang;Chen Tang;Weichen Gao;Huazhong Yang;Yongpan Liu\",\"doi\":\"10.1109/TCSII.2024.3439319\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fine-grained mixed-precision fully-quantized methods have great potential to accelerate neural network training, but existing methods exhibit large accuracy loss for more complex models such as diffusion networks. This brief introduces a fully-quantized training accelerator for diffusion networks. It features a novel training framework with tensor-type- and noise-strength-aware precision scheduling to optimize bit-width allocation. The processing cluster design enables dynamical switching bit-width mappings for model weights, allows concurrent processing in 4 different bit-widths, and incorporates a gradient square sum collection unit to minimize on-chip memory access. Experimental results show up to 2.4\\n<inline-formula> <tex-math>$\\\\times $ </tex-math></inline-formula>\\n training speedup and 81% operation bit-width overhead reduction compared to existing designs, with minimal impact on image generation quality.\",\"PeriodicalId\":13101,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems II: Express Briefs\",\"volume\":\"71 12\",\"pages\":\"4994-4998\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2024-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems II: Express Briefs\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10623715/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems II: Express Briefs","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10623715/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
A Fully Quantized Training Accelerator for Diffusion Network With Tensor Type & Noise Strength Aware Precision Scheduling
Fine-grained mixed-precision fully-quantized methods have great potential to accelerate neural network training, but existing methods exhibit large accuracy loss for more complex models such as diffusion networks. This brief introduces a fully-quantized training accelerator for diffusion networks. It features a novel training framework with tensor-type- and noise-strength-aware precision scheduling to optimize bit-width allocation. The processing cluster design enables dynamical switching bit-width mappings for model weights, allows concurrent processing in 4 different bit-widths, and incorporates a gradient square sum collection unit to minimize on-chip memory access. Experimental results show up to 2.4
$\times $
training speedup and 81% operation bit-width overhead reduction compared to existing designs, with minimal impact on image generation quality.
期刊介绍:
TCAS II publishes brief papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes:
Circuits: Analog, Digital and Mixed Signal Circuits and Systems
Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic
Circuits and Systems, Power Electronics and Systems
Software for Analog-and-Logic Circuits and Systems
Control aspects of Circuits and Systems.