Xu Zhang , Yuan Cheng , Dingyang Zou , Ke Gu , Meiqi Wang , Zhongfeng Wang
{"title":"ADCIM:用于节能注意力计算的近似数字内存中计算宏的可扩展构造","authors":"Xu Zhang , Yuan Cheng , Dingyang Zou , Ke Gu , Meiqi Wang , Zhongfeng Wang","doi":"10.1016/j.sysarc.2025.103512","DOIUrl":null,"url":null,"abstract":"<div><div>Digital compute-in-memory (DCIM) performs energy-efficient computation without accuracy loss, which has been proven to be a promising way to break the memory wall commonly existing in Transformer accelerators with von Neumann architecture. Approximate computing is also widely utilized to boost computation efficiency by exploiting error tolerance in neural networks. In this paper, we perform algorithm-hardware co-optimization to incorporate approximate multiplication into original full-precision DCIM, resulting in a more energy-efficient computing paradigm. First, a coarse-grained error compensation method is proposed to balance the error of partial product generation and partial product reduction, achieving almost zero mean error during multiplication operations. Secondly, a fine-grained error compensation is developed for accumulation operations, further suppressing the error of multiply-and-accumulate by 2-3 orders of magnitude. Additionally, based on the proposed approximate algorithm design, the structure of Static Random Access Memory (SRAM) cell is fully exploited to implement efficient approximate digital compute-in-memory (ADCIM), which can be scaled to different bit-widths. Finally, one value-adaptive error controller is utilized to match the error tolerance of the self-attention mechanism and enhance computation efficiency. The proposed ADCIM has been verified on Transformer models with different quantization precisions, and obtains peak energy efficiency of 14.91 tera-operations per second per watt (TOPS/W) @ 16-bit, 22.84 TOPS/W @ 12-bit, and 39.89 TOPS/W @ 8-bit, with negligible accuracy loss.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"167 ","pages":"Article 103512"},"PeriodicalIF":3.7000,"publicationDate":"2025-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ADCIM: scalable construction of approximate digital compute-in-memory MACRO for energy-efficient attention computation\",\"authors\":\"Xu Zhang , Yuan Cheng , Dingyang Zou , Ke Gu , Meiqi Wang , Zhongfeng Wang\",\"doi\":\"10.1016/j.sysarc.2025.103512\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Digital compute-in-memory (DCIM) performs energy-efficient computation without accuracy loss, which has been proven to be a promising way to break the memory wall commonly existing in Transformer accelerators with von Neumann architecture. Approximate computing is also widely utilized to boost computation efficiency by exploiting error tolerance in neural networks. In this paper, we perform algorithm-hardware co-optimization to incorporate approximate multiplication into original full-precision DCIM, resulting in a more energy-efficient computing paradigm. First, a coarse-grained error compensation method is proposed to balance the error of partial product generation and partial product reduction, achieving almost zero mean error during multiplication operations. Secondly, a fine-grained error compensation is developed for accumulation operations, further suppressing the error of multiply-and-accumulate by 2-3 orders of magnitude. Additionally, based on the proposed approximate algorithm design, the structure of Static Random Access Memory (SRAM) cell is fully exploited to implement efficient approximate digital compute-in-memory (ADCIM), which can be scaled to different bit-widths. Finally, one value-adaptive error controller is utilized to match the error tolerance of the self-attention mechanism and enhance computation efficiency. The proposed ADCIM has been verified on Transformer models with different quantization precisions, and obtains peak energy efficiency of 14.91 tera-operations per second per watt (TOPS/W) @ 16-bit, 22.84 TOPS/W @ 12-bit, and 39.89 TOPS/W @ 8-bit, with negligible accuracy loss.</div></div>\",\"PeriodicalId\":50027,\"journal\":{\"name\":\"Journal of Systems Architecture\",\"volume\":\"167 \",\"pages\":\"Article 103512\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems Architecture\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1383762125001845\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762125001845","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
ADCIM: scalable construction of approximate digital compute-in-memory MACRO for energy-efficient attention computation
Digital compute-in-memory (DCIM) performs energy-efficient computation without accuracy loss, which has been proven to be a promising way to break the memory wall commonly existing in Transformer accelerators with von Neumann architecture. Approximate computing is also widely utilized to boost computation efficiency by exploiting error tolerance in neural networks. In this paper, we perform algorithm-hardware co-optimization to incorporate approximate multiplication into original full-precision DCIM, resulting in a more energy-efficient computing paradigm. First, a coarse-grained error compensation method is proposed to balance the error of partial product generation and partial product reduction, achieving almost zero mean error during multiplication operations. Secondly, a fine-grained error compensation is developed for accumulation operations, further suppressing the error of multiply-and-accumulate by 2-3 orders of magnitude. Additionally, based on the proposed approximate algorithm design, the structure of Static Random Access Memory (SRAM) cell is fully exploited to implement efficient approximate digital compute-in-memory (ADCIM), which can be scaled to different bit-widths. Finally, one value-adaptive error controller is utilized to match the error tolerance of the self-attention mechanism and enhance computation efficiency. The proposed ADCIM has been verified on Transformer models with different quantization precisions, and obtains peak energy efficiency of 14.91 tera-operations per second per watt (TOPS/W) @ 16-bit, 22.84 TOPS/W @ 12-bit, and 39.89 TOPS/W @ 8-bit, with negligible accuracy loss.
期刊介绍:
The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software.
Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.