A Hybrid CAM-SRAM Processing-in-Memory Architecture With Feature Level Sparsity for Attention Mechanisms

IF 4.9 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Haiqiu Huang;Mingyu Wang;Xiaojie Li;Baiqing Zhong;Zeqi Yang;Tao Lu;Yicong Zhang;Zhiyi Yu
{"title":"A Hybrid CAM-SRAM Processing-in-Memory Architecture With Feature Level Sparsity for Attention Mechanisms","authors":"Haiqiu Huang;Mingyu Wang;Xiaojie Li;Baiqing Zhong;Zeqi Yang;Tao Lu;Yicong Zhang;Zhiyi Yu","doi":"10.1109/TCSII.2025.3590432","DOIUrl":null,"url":null,"abstract":"The attention mechanism has become increasingly popular due to its ability to capture complex dependencies, enabling models like transformers to achieve remarkable performance in large language models (LLMs), computer vision, and other domains. However, the mechanism faces challenges such as low arithmetic intensity, leading to frequent data movement, and long sequence lengths, which introduce a large amount of redundant information. To mitigate both data movement and computational overhead in attention mechanisms, we propose a hybrid CAM-SRAM processing-in-memory architecture. By leveraging the parallel search and sort capabilities of content-addressable memory (CAM) arrays, we achieve dynamic fine-grained sparsification on features with varying variance, reducing the number of multiply-accumulate (MAC) operations in the matrix multiplication (MatMul). Furthermore, an approximate booth encoding is employed in our MAC unit to reduce the number of partial products and maintain the consistency of their signs. This eliminates the need for negation operations, simplifying the logic design. Experimental results show that, in different configurations, our feature-level sparsification scheme achieves over 80% sparsity with an acceptable accuracy drop. With sparsity up to 80%, our design achieves a performance of 0.252-1.26 TOPS and a power efficiency of 4.71-21.72 TOPS/W, operating at 1000 MHz on the TSMC 40nm process.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 9","pages":"1283-1287"},"PeriodicalIF":4.9000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems II: Express Briefs","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11084893/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

The attention mechanism has become increasingly popular due to its ability to capture complex dependencies, enabling models like transformers to achieve remarkable performance in large language models (LLMs), computer vision, and other domains. However, the mechanism faces challenges such as low arithmetic intensity, leading to frequent data movement, and long sequence lengths, which introduce a large amount of redundant information. To mitigate both data movement and computational overhead in attention mechanisms, we propose a hybrid CAM-SRAM processing-in-memory architecture. By leveraging the parallel search and sort capabilities of content-addressable memory (CAM) arrays, we achieve dynamic fine-grained sparsification on features with varying variance, reducing the number of multiply-accumulate (MAC) operations in the matrix multiplication (MatMul). Furthermore, an approximate booth encoding is employed in our MAC unit to reduce the number of partial products and maintain the consistency of their signs. This eliminates the need for negation operations, simplifying the logic design. Experimental results show that, in different configurations, our feature-level sparsification scheme achieves over 80% sparsity with an acceptable accuracy drop. With sparsity up to 80%, our design achieves a performance of 0.252-1.26 TOPS and a power efficiency of 4.71-21.72 TOPS/W, operating at 1000 MHz on the TSMC 40nm process.
一种用于注意力机制的特征级稀疏的CAM-SRAM混合内存处理架构
由于能够捕获复杂的依赖关系,注意力机制已经变得越来越流行,使得像变压器这样的模型能够在大型语言模型(llm)、计算机视觉和其他领域中实现卓越的性能。然而,该机制面临着算术强度低、导致数据移动频繁、序列长度长、引入大量冗余信息等挑战。为了减轻注意力机制中的数据移动和计算开销,我们提出了一种混合CAM-SRAM内存处理架构。通过利用内容可寻址内存(CAM)数组的并行搜索和排序功能,我们实现了对具有不同方差的特征的动态细粒度稀疏化,减少了矩阵乘法(MatMul)中的乘法累加(MAC)操作的数量。此外,在我们的MAC单元中采用了近似的展位编码,以减少部分产品的数量并保持其标志的一致性。这消除了对否定操作的需要,简化了逻辑设计。实验结果表明,在不同的配置下,我们的特征级稀疏化方案在可接受的精度下降下实现了80%以上的稀疏化。我们的设计具有高达80%的稀疏性,在台积电40nm工艺的1000 MHz下,实现了0.252-1.26 TOPS的性能和4.71-21.72 TOPS/W的功率效率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Circuits and Systems II: Express Briefs
IEEE Transactions on Circuits and Systems II: Express Briefs 工程技术-工程:电子与电气
CiteScore
7.90
自引率
20.50%
发文量
883
审稿时长
3.0 months
期刊介绍: TCAS II publishes brief papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: Circuits: Analog, Digital and Mixed Signal Circuits and Systems Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic Circuits and Systems, Power Electronics and Systems Software for Analog-and-Logic Circuits and Systems Control aspects of Circuits and Systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信