Zhiwei Lin, Yubin Qin, Jiachen Wang, Yang Wang, Huanyu Wang, Zhe Zheng, Wenpeng Cui, Shaojun Wei, Yang Hu, Shouyi Yin
{"title":"BLADE: Energy-efficient attention accelerator with fused kernel and bit-level redundancy elimination","authors":"Zhiwei Lin, Yubin Qin, Jiachen Wang, Yang Wang, Huanyu Wang, Zhe Zheng, Wenpeng Cui, Shaojun Wei, Yang Hu, Shouyi Yin","doi":"10.1049/ell2.70137","DOIUrl":null,"url":null,"abstract":"<p>Attention-based transformer model has achieved remarkable performance in various artificial intelligence fields, while the attention computation, being a combination of matrix multiplication and softmax function, is still sub-optimized in terms of hardware implementation. Normally, it needs 3 pass of input memory access to compute attention, and the on-chip storage requirement is coupled with the input length, both of which pose significant memory issues. Further, the computation burden is heavy for long inputs. This paper proposes an algorithm-hardware co-design for attention. On algorithm side, it uses a linear-softmax fused kernel to fuse the matrix multiplication and non-linear functions, which enables high on-chip memory source utilization. On hardware side, it shows an accelerator named BLADE with identical partial product removing, which eliminates unnecessary computation by exploiting math feature of softmax. Experiment on ViT, Swin transformer, GPT-2, and LLaMA2 show that the proposed design achieves <span></span><math>\n <semantics>\n <mrow>\n <mn>10.6</mn>\n </mrow>\n <annotation>$10.6$</annotation>\n </semantics></math>–<span></span><math>\n <semantics>\n <mrow>\n <mn>18.7</mn>\n <mo>%</mo>\n </mrow>\n <annotation>$18.7\\%$</annotation>\n </semantics></math> of energy efficiency improvement compared to state-of-the-art FlashAttention implementations.</p>","PeriodicalId":11556,"journal":{"name":"Electronics Letters","volume":"61 1","pages":""},"PeriodicalIF":0.7000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ell2.70137","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronics Letters","FirstCategoryId":"5","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/ell2.70137","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Attention-based transformer model has achieved remarkable performance in various artificial intelligence fields, while the attention computation, being a combination of matrix multiplication and softmax function, is still sub-optimized in terms of hardware implementation. Normally, it needs 3 pass of input memory access to compute attention, and the on-chip storage requirement is coupled with the input length, both of which pose significant memory issues. Further, the computation burden is heavy for long inputs. This paper proposes an algorithm-hardware co-design for attention. On algorithm side, it uses a linear-softmax fused kernel to fuse the matrix multiplication and non-linear functions, which enables high on-chip memory source utilization. On hardware side, it shows an accelerator named BLADE with identical partial product removing, which eliminates unnecessary computation by exploiting math feature of softmax. Experiment on ViT, Swin transformer, GPT-2, and LLaMA2 show that the proposed design achieves – of energy efficiency improvement compared to state-of-the-art FlashAttention implementations.
期刊介绍:
Electronics Letters is an internationally renowned peer-reviewed rapid-communication journal that publishes short original research papers every two weeks. Its broad and interdisciplinary scope covers the latest developments in all electronic engineering related fields including communication, biomedical, optical and device technologies. Electronics Letters also provides further insight into some of the latest developments through special features and interviews.
Scope
As a journal at the forefront of its field, Electronics Letters publishes papers covering all themes of electronic and electrical engineering. The major themes of the journal are listed below.
Antennas and Propagation
Biomedical and Bioinspired Technologies, Signal Processing and Applications
Control Engineering
Electromagnetism: Theory, Materials and Devices
Electronic Circuits and Systems
Image, Video and Vision Processing and Applications
Information, Computing and Communications
Instrumentation and Measurement
Microwave Technology
Optical Communications
Photonics and Opto-Electronics
Power Electronics, Energy and Sustainability
Radar, Sonar and Navigation
Semiconductor Technology
Signal Processing
MIMO