{"title":"BETA: A Bit-Grained Transformer Attention Accelerator With Efficient Early Termination","authors":"Huizheng Wang;Hongbin Wang;Zhiheng Yue;Jingyao Liu;Taiquan Wei;Shaojun Wei;Yang Hu;Shouyi Yin","doi":"10.1109/TCSII.2025.3596228","DOIUrl":null,"url":null,"abstract":"Attention-based large language models (LLMs) have revolutionized the natural language processing (NLP). Despite their impressive effectiveness, the quadratic complexity of self-attention incurs heavy computational and memory burdens. Dynamic sparse attention techniques emerge as a solution, however, the introduced extra prediction stage, coupled with costly data memory access, diminishes their hardware efficiency. To address these limitations, this brief proposes BETA, a fine-grained algorithm-architecture co-design tailored for sparse attention. First, a bit-grained multi-round filter (BMF) prediction is proposed to unveil and eliminate redundant memory access hidden in the sparsity prediction stage. Second, an adaptive and lightweight max-based threshold selection (MTS) strategy is developed to work in concert with the bit-wise prediction process. Third, a bit-wise out-of-order execution (BOOE) scheme is employed to enhance hardware utilization during bit-wise prediction. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Implementation results demonstrate that BETA achieves <inline-formula> <tex-math>$5.4\\times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$6.5\\times $ </tex-math></inline-formula>, <inline-formula> <tex-math>$1.8\\times $ </tex-math></inline-formula> improvements in energy efficiency than the state-of-the-art Transformer accelerators Sanger, Spatten and SOFA, respectively, while maintaining comparable inference accuracy.","PeriodicalId":13101,"journal":{"name":"IEEE Transactions on Circuits and Systems II: Express Briefs","volume":"72 10","pages":"1433-1437"},"PeriodicalIF":4.9000,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems II: Express Briefs","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/11117182/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Attention-based large language models (LLMs) have revolutionized the natural language processing (NLP). Despite their impressive effectiveness, the quadratic complexity of self-attention incurs heavy computational and memory burdens. Dynamic sparse attention techniques emerge as a solution, however, the introduced extra prediction stage, coupled with costly data memory access, diminishes their hardware efficiency. To address these limitations, this brief proposes BETA, a fine-grained algorithm-architecture co-design tailored for sparse attention. First, a bit-grained multi-round filter (BMF) prediction is proposed to unveil and eliminate redundant memory access hidden in the sparsity prediction stage. Second, an adaptive and lightweight max-based threshold selection (MTS) strategy is developed to work in concert with the bit-wise prediction process. Third, a bit-wise out-of-order execution (BOOE) scheme is employed to enhance hardware utilization during bit-wise prediction. Finally, an elaborate architecture is designed to translate the theoretical complexity reduction into practical performance improvement. Implementation results demonstrate that BETA achieves $5.4\times $ , $6.5\times $ , $1.8\times $ improvements in energy efficiency than the state-of-the-art Transformer accelerators Sanger, Spatten and SOFA, respectively, while maintaining comparable inference accuracy.
期刊介绍:
TCAS II publishes brief papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes:
Circuits: Analog, Digital and Mixed Signal Circuits and Systems
Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic
Circuits and Systems, Power Electronics and Systems
Software for Analog-and-Logic Circuits and Systems
Control aspects of Circuits and Systems.