ADCIM：用于节能注意力计算的近似数字内存中计算宏的可扩展构造

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture Pub Date : 2025-07-05 DOI:10.1016/j.sysarc.2025.103512

Xu Zhang , Yuan Cheng , Dingyang Zou , Ke Gu , Meiqi Wang , Zhongfeng Wang

{"title":"ADCIM：用于节能注意力计算的近似数字内存中计算宏的可扩展构造","authors":"Xu Zhang , Yuan Cheng , Dingyang Zou , Ke Gu , Meiqi Wang , Zhongfeng Wang","doi":"10.1016/j.sysarc.2025.103512","DOIUrl":null,"url":null,"abstract":"<div><div>Digital compute-in-memory (DCIM) performs energy-efficient computation without accuracy loss, which has been proven to be a promising way to break the memory wall commonly existing in Transformer accelerators with von Neumann architecture. Approximate computing is also widely utilized to boost computation efficiency by exploiting error tolerance in neural networks. In this paper, we perform algorithm-hardware co-optimization to incorporate approximate multiplication into original full-precision DCIM, resulting in a more energy-efficient computing paradigm. First, a coarse-grained error compensation method is proposed to balance the error of partial product generation and partial product reduction, achieving almost zero mean error during multiplication operations. Secondly, a fine-grained error compensation is developed for accumulation operations, further suppressing the error of multiply-and-accumulate by 2-3 orders of magnitude. Additionally, based on the proposed approximate algorithm design, the structure of Static Random Access Memory (SRAM) cell is fully exploited to implement efficient approximate digital compute-in-memory (ADCIM), which can be scaled to different bit-widths. Finally, one value-adaptive error controller is utilized to match the error tolerance of the self-attention mechanism and enhance computation efficiency. The proposed ADCIM has been verified on Transformer models with different quantization precisions, and obtains peak energy efficiency of 14.91 tera-operations per second per watt (TOPS/W) @ 16-bit, 22.84 TOPS/W @ 12-bit, and 39.89 TOPS/W @ 8-bit, with negligible accuracy loss.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103512"},"PeriodicalIF":4.1000,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ADCIM: scalable construction of approximate digital compute-in-memory MACRO for energy-efficient attention computation\",\"authors\":\"Xu Zhang , Yuan Cheng , Dingyang Zou , Ke Gu , Meiqi Wang , Zhongfeng Wang\",\"doi\":\"10.1016/j.sysarc.2025.103512\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Digital compute-in-memory (DCIM) performs energy-efficient computation without accuracy loss, which has been proven to be a promising way to break the memory wall commonly existing in Transformer accelerators with von Neumann architecture. Approximate computing is also widely utilized to boost computation efficiency by exploiting error tolerance in neural networks. In this paper, we perform algorithm-hardware co-optimization to incorporate approximate multiplication into original full-precision DCIM, resulting in a more energy-efficient computing paradigm. First, a coarse-grained error compensation method is proposed to balance the error of partial product generation and partial product reduction, achieving almost zero mean error during multiplication operations. Secondly, a fine-grained error compensation is developed for accumulation operations, further suppressing the error of multiply-and-accumulate by 2-3 orders of magnitude. Additionally, based on the proposed approximate algorithm design, the structure of Static Random Access Memory (SRAM) cell is fully exploited to implement efficient approximate digital compute-in-memory (ADCIM), which can be scaled to different bit-widths. Finally, one value-adaptive error controller is utilized to match the error tolerance of the self-attention mechanism and enhance computation efficiency. The proposed ADCIM has been verified on Transformer models with different quantization precisions, and obtains peak energy efficiency of 14.91 tera-operations per second per watt (TOPS/W) @ 16-bit, 22.84 TOPS/W @ 12-bit, and 39.89 TOPS/W @ 8-bit, with negligible accuracy loss.</div></div>\",\"PeriodicalId\":50027,\"journal\":{\"name\":\"Journal of Systems Architecture\",\"volume\":\"167 \",\"pages\":\"Article 103512\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2025-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems Architecture\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1383762125001845\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762125001845","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

数字内存计算（DCIM）在不损失精度的情况下进行高效节能的计算，已被证明是一种很有前途的方法，可以打破von Neumann结构Transformer加速器中普遍存在的存储墙。在神经网络中，近似计算也被广泛应用于利用容错性来提高计算效率。在本文中，我们执行算法-硬件协同优化，将近似乘法纳入原始的全精度DCIM，从而产生更节能的计算范式。首先，提出了一种粗粒度误差补偿方法来平衡部分积生成和部分积约简的误差，使乘法运算的平均误差几乎为零；其次，对累加运算进行了细粒度误差补偿，进一步将乘加累加运算的误差抑制了2-3个数量级。此外，基于所提出的近似算法设计，充分利用静态随机存取存储器（SRAM）单元的结构来实现高效的近似数字内存计算（ADCIM），该算法可以扩展到不同的位宽。最后，利用一种自适应误差控制器来匹配自关注机制的容错性，提高计算效率。所提出的ADCIM已在具有不同量化精度的Transformer模型上进行了验证，并获得峰值能效为14.91万亿次/瓦特（TOPS/W） @ 16位，22.84 TOPS/W @ 12位和39.89 TOPS/W @ 8位，精度损失可以忽略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ADCIM: scalable construction of approximate digital compute-in-memory MACRO for energy-efficient attention computation

Digital compute-in-memory (DCIM) performs energy-efficient computation without accuracy loss, which has been proven to be a promising way to break the memory wall commonly existing in Transformer accelerators with von Neumann architecture. Approximate computing is also widely utilized to boost computation efficiency by exploiting error tolerance in neural networks. In this paper, we perform algorithm-hardware co-optimization to incorporate approximate multiplication into original full-precision DCIM, resulting in a more energy-efficient computing paradigm. First, a coarse-grained error compensation method is proposed to balance the error of partial product generation and partial product reduction, achieving almost zero mean error during multiplication operations. Secondly, a fine-grained error compensation is developed for accumulation operations, further suppressing the error of multiply-and-accumulate by 2-3 orders of magnitude. Additionally, based on the proposed approximate algorithm design, the structure of Static Random Access Memory (SRAM) cell is fully exploited to implement efficient approximate digital compute-in-memory (ADCIM), which can be scaled to different bit-widths. Finally, one value-adaptive error controller is utilized to match the error tolerance of the self-attention mechanism and enhance computation efficiency. The proposed ADCIM has been verified on Transformer models with different quantization precisions, and obtains peak energy efficiency of 14.91 tera-operations per second per watt (TOPS/W) @ 16-bit, 22.84 TOPS/W @ 12-bit, and 39.89 TOPS/W @ 8-bit, with negligible accuracy loss.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Systems Architecture 工程技术-计算机：硬件

CiteScore

8.70

自引率

15.60%

发文量

226

审稿时长

46 days

期刊介绍： The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software. Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.