DTQAtten: Leveraging Dynamic Token-based Quantization for Efficient Attention Architecture

Tao Yang, Dongyue Li, Zhuoran Song, Yilong Zhao, Fangxin Liu, Zongwu Wang, Zhezhi He, Li Jiang
{"title":"DTQAtten: Leveraging Dynamic Token-based Quantization for Efficient Attention Architecture","authors":"Tao Yang, Dongyue Li, Zhuoran Song, Yilong Zhao, Fangxin Liu, Zongwu Wang, Zhezhi He, Li Jiang","doi":"10.23919/DATE54114.2022.9774692","DOIUrl":null,"url":null,"abstract":"Models based on the attention mechanism, i.e. transformers, have shown extraordinary performance in Natural Language Processing (NLP) tasks. However, their memory footprint, inference latency, and power consumption are still prohibitive for efficient inference at edge devices, even at data centers. To tackle this issue, we present an algorithm-architecture co-design with dynamic and mixed-precision quantization, DTQAtten. We present empirically that the tolerance to the noise varies from token to token in attention-based NLP models. This finding leads us to quantize different tokens with mixed levels of bits. Thus, we design a compression framework that (i) dynamically quantizes tokens while they are forwarded in the models and (ii) jointly determines the ratio of each precision. Moreover, due to the dynamic mixed-precision tokens caused by our framework, previous matrix-multiplication accelerators (e.g. systolic array) cannot effectively exploit the benefit of the compressed attention computation. We thus design our accelerator with the variable-speed systolic array (VSSA) and propose an effective optimization strategy to alleviate the pipeline-stall problem in VSSA without hardware overhead. We conduct experiments with existing attention-based NLP models, including BERT and GPT-2 on various language tasks. Our results show that DTQAtten outperforms the previous neural network accelerator Eyeriss by 13.12× in terms of speedup and 3.8× in terms of energy-saving. Compared with the state-of-the-art attention accelerator SpAtten, our DTQAtten achieves at least 2.65× speedup and 3.38× energy efficiency improvement.","PeriodicalId":232583,"journal":{"name":"2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/DATE54114.2022.9774692","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Models based on the attention mechanism, i.e. transformers, have shown extraordinary performance in Natural Language Processing (NLP) tasks. However, their memory footprint, inference latency, and power consumption are still prohibitive for efficient inference at edge devices, even at data centers. To tackle this issue, we present an algorithm-architecture co-design with dynamic and mixed-precision quantization, DTQAtten. We present empirically that the tolerance to the noise varies from token to token in attention-based NLP models. This finding leads us to quantize different tokens with mixed levels of bits. Thus, we design a compression framework that (i) dynamically quantizes tokens while they are forwarded in the models and (ii) jointly determines the ratio of each precision. Moreover, due to the dynamic mixed-precision tokens caused by our framework, previous matrix-multiplication accelerators (e.g. systolic array) cannot effectively exploit the benefit of the compressed attention computation. We thus design our accelerator with the variable-speed systolic array (VSSA) and propose an effective optimization strategy to alleviate the pipeline-stall problem in VSSA without hardware overhead. We conduct experiments with existing attention-based NLP models, including BERT and GPT-2 on various language tasks. Our results show that DTQAtten outperforms the previous neural network accelerator Eyeriss by 13.12× in terms of speedup and 3.8× in terms of energy-saving. Compared with the state-of-the-art attention accelerator SpAtten, our DTQAtten achieves at least 2.65× speedup and 3.38× energy efficiency improvement.
dtqattenn:利用动态令牌量化实现高效的注意力架构
基于注意机制的模型,即变形器,在自然语言处理(NLP)任务中表现出了非凡的性能。然而,它们的内存占用、推理延迟和功耗仍然不利于在边缘设备(甚至在数据中心)上进行有效的推理。为了解决这个问题,我们提出了一种与动态和混合精度量化的算法架构协同设计,DTQAtten。我们的经验表明,在基于注意力的NLP模型中,对噪声的容忍度因标记而异。这一发现使我们可以用混合的比特级别量化不同的令牌。因此,我们设计了一个压缩框架,它(i)在模型中转发令牌时动态量化令牌,(ii)共同确定每个精度的比例。此外,由于我们的框架造成的动态混合精度标记,以前的矩阵乘法加速器(如收缩数组)不能有效地利用压缩注意力计算的好处。因此,我们设计了变速收缩阵列加速器,并提出了一种有效的优化策略,可以在不增加硬件开销的情况下缓解变速收缩阵列加速器中的管道失速问题。我们使用现有的基于注意力的NLP模型(包括BERT和GPT-2)对各种语言任务进行了实验。我们的研究结果表明,DTQAtten在加速方面比之前的神经网络加速器Eyeriss高出13.12倍,在节能方面高出3.8倍。与目前最先进的注意力加速器SpAtten相比,我们的DTQAtten实现了至少2.65倍的加速和3.38倍的能效提升。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信