ABS:设计低精度张量核心的累积位宽缩放方法

IF 2.8 2区 工程技术 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Yasong Cao;Mei Wen;Zhongdi Luo;Xin Ju;Haolan Huang;Junzhong Shen;Haiyan Chen
{"title":"ABS:设计低精度张量核心的累积位宽缩放方法","authors":"Yasong Cao;Mei Wen;Zhongdi Luo;Xin Ju;Haolan Huang;Junzhong Shen;Haiyan Chen","doi":"10.1109/TVLSI.2024.3414260","DOIUrl":null,"url":null,"abstract":"A big gap exists between deep neural network (DNN) applications’ computational demand and the computing power of DNN accelerators. Low-precision floating-point (LP-FP) computation is one of the important means to improve the performance of DNN training and inference. However, the high-precision accumulators are typically applied to summating the dot products during general matrix multiplication (GEMM) in tensor cores (TCs). As the precision of data decreases, the accumulator becomes the main consumer of multiply-accumulate’s (MAC’s) area and power. Reducing the accumulators’ bit-width is of significant importance for improving the area- and energy-efficiency of TCs. There are two main challenges: 1) theoretical support on the floating-point (FP) formats with the lowest bit-width of TC’s accumulators and 2) how to integrate the LP-FP TC in the framework of DNN training and inference to evaluate its benefits. In this article, we propose accumulation bit-width scaling (ABS), a novel ABS method, to guide the design of LP-FP TCs. We 1) implement this method by constructing a novel variance retention ratio (VRR) model to predict the FP format with the minimum bit-width for TC’s accumulator; 2) provide a generator of DNN accelerator based on a systolic-array (SA) TC, supporting many low-precision configurations; and 3) design an LP-FP DNN executing framework that supports software-simulation mode and hardware-accelerator mode to run LP-FP DNN tasks. The experimental results show that the LP-FP TC guided by our ABS method has a maximum reduction of 76.47% and 75.60% in area and power consumption, respectively, compared with the advanced TCs.","PeriodicalId":13425,"journal":{"name":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","volume":"32 9","pages":"1590-1601"},"PeriodicalIF":2.8000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ABS: Accumulation Bit-Width Scaling Method for Designing Low-Precision Tensor Core\",\"authors\":\"Yasong Cao;Mei Wen;Zhongdi Luo;Xin Ju;Haolan Huang;Junzhong Shen;Haiyan Chen\",\"doi\":\"10.1109/TVLSI.2024.3414260\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A big gap exists between deep neural network (DNN) applications’ computational demand and the computing power of DNN accelerators. Low-precision floating-point (LP-FP) computation is one of the important means to improve the performance of DNN training and inference. However, the high-precision accumulators are typically applied to summating the dot products during general matrix multiplication (GEMM) in tensor cores (TCs). As the precision of data decreases, the accumulator becomes the main consumer of multiply-accumulate’s (MAC’s) area and power. Reducing the accumulators’ bit-width is of significant importance for improving the area- and energy-efficiency of TCs. There are two main challenges: 1) theoretical support on the floating-point (FP) formats with the lowest bit-width of TC’s accumulators and 2) how to integrate the LP-FP TC in the framework of DNN training and inference to evaluate its benefits. In this article, we propose accumulation bit-width scaling (ABS), a novel ABS method, to guide the design of LP-FP TCs. We 1) implement this method by constructing a novel variance retention ratio (VRR) model to predict the FP format with the minimum bit-width for TC’s accumulator; 2) provide a generator of DNN accelerator based on a systolic-array (SA) TC, supporting many low-precision configurations; and 3) design an LP-FP DNN executing framework that supports software-simulation mode and hardware-accelerator mode to run LP-FP DNN tasks. The experimental results show that the LP-FP TC guided by our ABS method has a maximum reduction of 76.47% and 75.60% in area and power consumption, respectively, compared with the advanced TCs.\",\"PeriodicalId\":13425,\"journal\":{\"name\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"volume\":\"32 9\",\"pages\":\"1590-1601\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Very Large Scale Integration (VLSI) Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10571370/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Very Large Scale Integration (VLSI) Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10571370/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

摘要

深度神经网络(DNN)应用的计算需求与DNN加速器的计算能力之间存在巨大差距。低精度浮点(LP-FP)计算是提高 DNN 训练和推理性能的重要手段之一。然而,高精度累加器通常用于在张量内核(TC)的通用矩阵乘法(GEMM)过程中求和点积。随着数据精度的降低,累加器成为乘法累加器(MAC)面积和功耗的主要消耗者。减少累加器的位宽对提高 TC 的面积和能效具有重要意义。目前面临两大挑战1) 从理论上支持具有最低积算器位宽的浮点(FP)格式;2) 如何将 LP-FP TC 集成到 DNN 训练和推理框架中,以评估其优势。在本文中,我们提出了一种新颖的累加位宽缩放(ABS)方法来指导 LP-FP TC 的设计。我们:1)通过构建一个新颖的方差保留率(VRR)模型来预测积算器位宽最小的 FP 格式,从而实现该方法;2)提供一个基于收缩阵列(SA)积算器的 DNN 加速器生成器,支持多种低精度配置;3)设计一个 LP-FP DNN 执行框架,支持软件模拟模式和硬件加速模式,以运行 LP-FP DNN 任务。实验结果表明,采用我们的 ABS 方法指导的 LP-FP TC 与先进的 TC 相比,面积和功耗分别最大减少了 76.47% 和 75.60%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
ABS: Accumulation Bit-Width Scaling Method for Designing Low-Precision Tensor Core
A big gap exists between deep neural network (DNN) applications’ computational demand and the computing power of DNN accelerators. Low-precision floating-point (LP-FP) computation is one of the important means to improve the performance of DNN training and inference. However, the high-precision accumulators are typically applied to summating the dot products during general matrix multiplication (GEMM) in tensor cores (TCs). As the precision of data decreases, the accumulator becomes the main consumer of multiply-accumulate’s (MAC’s) area and power. Reducing the accumulators’ bit-width is of significant importance for improving the area- and energy-efficiency of TCs. There are two main challenges: 1) theoretical support on the floating-point (FP) formats with the lowest bit-width of TC’s accumulators and 2) how to integrate the LP-FP TC in the framework of DNN training and inference to evaluate its benefits. In this article, we propose accumulation bit-width scaling (ABS), a novel ABS method, to guide the design of LP-FP TCs. We 1) implement this method by constructing a novel variance retention ratio (VRR) model to predict the FP format with the minimum bit-width for TC’s accumulator; 2) provide a generator of DNN accelerator based on a systolic-array (SA) TC, supporting many low-precision configurations; and 3) design an LP-FP DNN executing framework that supports software-simulation mode and hardware-accelerator mode to run LP-FP DNN tasks. The experimental results show that the LP-FP TC guided by our ABS method has a maximum reduction of 76.47% and 75.60% in area and power consumption, respectively, compared with the advanced TCs.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
6.40
自引率
7.10%
发文量
187
审稿时长
3.6 months
期刊介绍: The IEEE Transactions on VLSI Systems is published as a monthly journal under the co-sponsorship of the IEEE Circuits and Systems Society, the IEEE Computer Society, and the IEEE Solid-State Circuits Society. Design and realization of microelectronic systems using VLSI/ULSI technologies require close collaboration among scientists and engineers in the fields of systems architecture, logic and circuit design, chips and wafer fabrication, packaging, testing and systems applications. Generation of specifications, design and verification must be performed at all abstraction levels, including the system, register-transfer, logic, circuit, transistor and process levels. To address this critical area through a common forum, the IEEE Transactions on VLSI Systems have been founded. The editorial board, consisting of international experts, invites original papers which emphasize and merit the novel systems integration aspects of microelectronic systems including interactions among systems design and partitioning, logic and memory design, digital and analog circuit design, layout synthesis, CAD tools, chips and wafer fabrication, testing and packaging, and systems level qualification. Thus, the coverage of these Transactions will focus on VLSI/ULSI microelectronic systems integration.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信