BETA: A Bit-Grained Transformer Attention Accelerator With Efficient Early Termination

IF 4.9 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Huizheng Wang;Hongbin Wang;Zhiheng Yue;Jingyao Liu;Taiquan Wei;Shaojun Wei;Yang Hu;Shouyi Yin
{"title":"BETA: A Bit-Grained Transformer Attention Accelerator With Efficient Early Termination","authors":"Huizheng Wang;Hongbin Wang;Zhiheng Yue;Jingyao Liu;Taiquan Wei;Shaojun Wei;Yang Hu;Shouyi Yin","doi":"10.1109/TCSII.2025.3596228","DOIUrl":null,"url":null,"abstract":"Attention-based large language models (LLMs) have revolutionized the natural language processing (NLP). Despite their impressive effectiveness, the quadratic complexity of self-attention incurs heavy computational and memory burdens. Dynamic sparse attention techniques emerge as a solution, however, the introduced extra prediction stage, coupled with costly data memory access, diminishes their hardware efficiency. To address these limitations, this brief proposes BETA, a fine-grained algorithm-architecture co-design tailored for sparse attention. First, a bit-grained multi-round filter (BMF) prediction is proposed to unveil and eliminate redundant memory access hidden in the sparsity prediction stage. Second, an adaptive and lightweight max-based threshold selection (MTS) strategy is developed to work in concert with the bit-wise prediction process. Third, a bit-wise out-of-order execution (BOOE) scheme is employed to enhance hardware utilization during bit-wise prediction. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Implementation results demonstrate that BETA achieves <inline-formula> <tex-math>$5.4\\times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$6.5\\times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$1.8\\times $ </tex-math></inline-formula> improvements in energy efficiency than the state-of-the-art Transformer accelerators Sanger, Spatten and SOFA, respectively, while maintaining comparable inference accuracy.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 10","pages":"1433-1437"},"PeriodicalIF":4.9000,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems II: Express Briefs","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11117182/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Attention-based large language models (LLMs) have revolutionized the natural language processing (NLP). Despite their impressive effectiveness, the quadratic complexity of self-attention incurs heavy computational and memory burdens. Dynamic sparse attention techniques emerge as a solution, however, the introduced extra prediction stage, coupled with costly data memory access, diminishes their hardware efficiency. To address these limitations, this brief proposes BETA, a fine-grained algorithm-architecture co-design tailored for sparse attention. First, a bit-grained multi-round filter (BMF) prediction is proposed to unveil and eliminate redundant memory access hidden in the sparsity prediction stage. Second, an adaptive and lightweight max-based threshold selection (MTS) strategy is developed to work in concert with the bit-wise prediction process. Third, a bit-wise out-of-order execution (BOOE) scheme is employed to enhance hardware utilization during bit-wise prediction. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Implementation results demonstrate that BETA achieves $5.4\times $ , $6.5\times $ , $1.8\times $ improvements in energy efficiency than the state-of-the-art Transformer accelerators Sanger, Spatten and SOFA, respectively, while maintaining comparable inference accuracy.
BETA:具有有效早期终止的位粒度变压器注意力加速器
基于注意力的大型语言模型(llm)已经彻底改变了自然语言处理(NLP)。尽管它们的有效性令人印象深刻,但自我关注的二次复杂度会带来沉重的计算和内存负担。动态稀疏注意力技术作为一种解决方案出现了,然而,引入了额外的预测阶段,加上昂贵的数据内存访问,降低了它们的硬件效率。为了解决这些限制,本文提出了BETA,这是一种为稀疏注意力量身定制的细粒度算法架构协同设计。首先,提出了一种位粒度的多轮滤波器(BMF)预测,以揭示和消除隐藏在稀疏性预测阶段的冗余内存访问。其次,开发了一种自适应和轻量级的基于最大值的阈值选择(MTS)策略,以配合逐位预测过程。第三,采用逐位乱序执行(BOOE)方案来提高逐位预测过程中的硬件利用率。最后,设计了一个复杂的体系结构,将理论上的复杂性降低转化为实际性能的提高。实施结果表明,BETA在能源效率方面分别比最先进的Transformer加速器Sanger、Spatten和SOFA提高了5.4倍、6.5倍和1.8倍,同时保持了相当的推理精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Circuits and Systems II: Express Briefs
IEEE Transactions on Circuits and Systems II: Express Briefs 工程技术-工程:电子与电气
CiteScore
7.90
自引率
20.50%
发文量
883
审稿时长
3.0 months
期刊介绍: TCAS II publishes brief papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: Circuits: Analog, Digital and Mixed Signal Circuits and Systems Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic Circuits and Systems, Power Electronics and Systems Software for Analog-and-Logic Circuits and Systems Control aspects of Circuits and Systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信