A Hybrid CAM-SRAM Processing-in-Memory Architecture With Feature Level Sparsity for Attention Mechanisms

IF 4.9 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems II: Express Briefs Pub Date : 2025-07-18 DOI:10.1109/TCSII.2025.3590432

Haiqiu Huang;Mingyu Wang;Xiaojie Li;Baiqing Zhong;Zeqi Yang;Tao Lu;Yicong Zhang;Zhiyi Yu

{"title":"A Hybrid CAM-SRAM Processing-in-Memory Architecture With Feature Level Sparsity for Attention Mechanisms","authors":"Haiqiu Huang;Mingyu Wang;Xiaojie Li;Baiqing Zhong;Zeqi Yang;Tao Lu;Yicong Zhang;Zhiyi Yu","doi":"10.1109/TCSII.2025.3590432","DOIUrl":null,"url":null,"abstract":"The attention mechanism has become increasingly popular due to its ability to capture complex dependencies, enabling models like transformers to achieve remarkable performance in large language models (LLMs), computer vision, and other domains. However, the mechanism faces challenges such as low arithmetic intensity, leading to frequent data movement, and long sequence lengths, which introduce a large amount of redundant information. To mitigate both data movement and computational overhead in attention mechanisms, we propose a hybrid CAM-SRAM processing-in-memory architecture. By leveraging the parallel search and sort capabilities of content-addressable memory (CAM) arrays, we achieve dynamic fine-grained sparsification on features with varying variance, reducing the number of multiply-accumulate (MAC) operations in the matrix multiplication (MatMul). Furthermore, an approximate booth encoding is employed in our MAC unit to reduce the number of partial products and maintain the consistency of their signs. This eliminates the need for negation operations, simplifying the logic design. Experimental results show that, in different configurations, our feature-level sparsification scheme achieves over 80% sparsity with an acceptable accuracy drop. With sparsity up to 80%, our design achieves a performance of 0.252-1.26 TOPS and a power efficiency of 4.71-21.72 TOPS/W, operating at 1000 MHz on the TSMC 40nm process.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1283-1287"},"PeriodicalIF":4.9000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems II: Express Briefs","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11084893/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

The attention mechanism has become increasingly popular due to its ability to capture complex dependencies, enabling models like transformers to achieve remarkable performance in large language models (LLMs), computer vision, and other domains. However, the mechanism faces challenges such as low arithmetic intensity, leading to frequent data movement, and long sequence lengths, which introduce a large amount of redundant information. To mitigate both data movement and computational overhead in attention mechanisms, we propose a hybrid CAM-SRAM processing-in-memory architecture. By leveraging the parallel search and sort capabilities of content-addressable memory (CAM) arrays, we achieve dynamic fine-grained sparsification on features with varying variance, reducing the number of multiply-accumulate (MAC) operations in the matrix multiplication (MatMul). Furthermore, an approximate booth encoding is employed in our MAC unit to reduce the number of partial products and maintain the consistency of their signs. This eliminates the need for negation operations, simplifying the logic design. Experimental results show that, in different configurations, our feature-level sparsification scheme achieves over 80% sparsity with an acceptable accuracy drop. With sparsity up to 80%, our design achieves a performance of 0.252-1.26 TOPS and a power efficiency of 4.71-21.72 TOPS/W, operating at 1000 MHz on the TSMC 40nm process.

查看原文本刊更多论文

一种用于注意力机制的特征级稀疏的CAM-SRAM混合内存处理架构

由于能够捕获复杂的依赖关系，注意力机制已经变得越来越流行，使得像变压器这样的模型能够在大型语言模型（llm）、计算机视觉和其他领域中实现卓越的性能。然而，该机制面临着算术强度低、导致数据移动频繁、序列长度长、引入大量冗余信息等挑战。为了减轻注意力机制中的数据移动和计算开销，我们提出了一种混合CAM-SRAM内存处理架构。通过利用内容可寻址内存（CAM）数组的并行搜索和排序功能，我们实现了对具有不同方差的特征的动态细粒度稀疏化，减少了矩阵乘法（MatMul）中的乘法累加（MAC）操作的数量。此外，在我们的MAC单元中采用了近似的展位编码，以减少部分产品的数量并保持其标志的一致性。这消除了对否定操作的需要，简化了逻辑设计。实验结果表明，在不同的配置下，我们的特征级稀疏化方案在可接受的精度下降下实现了80%以上的稀疏化。我们的设计具有高达80%的稀疏性，在台积电40nm工艺的1000 MHz下，实现了0.252-1.26 TOPS的性能和4.71-21.72 TOPS/W的功率效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems II: Express Briefs 工程技术-工程：电子与电气

CiteScore

7.90

自引率

20.50%

发文量

883

审稿时长

3.0 months

期刊介绍： TCAS II publishes brief papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: Circuits: Analog, Digital and Mixed Signal Circuits and Systems Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic Circuits and Systems, Power Electronics and Systems Software for Analog-and-Logic Circuits and Systems Control aspects of Circuits and Systems.