通过基于梯度的学习运行时剪枝来加速注意力

Proceedings of the 49th Annual International Symposium on Computer Architecture Pub Date : 2022-04-07 DOI:10.1145/3470496.3527423

Zheng Li, Soroush Ghodrati, A. Yazdanbakhsh, H. Esmaeilzadeh, Mingu Kang

{"title":"通过基于梯度的学习运行时剪枝来加速注意力","authors":"Zheng Li, Soroush Ghodrati, A. Yazdanbakhsh, H. Esmaeilzadeh, Mingu Kang","doi":"10.1145/3470496.3527423","DOIUrl":null,"url":null,"abstract":"Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9×and 3.9×speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (< 0.2% degradation).","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Accelerating attention through gradient-based learned runtime pruning\",\"authors\":\"Zheng Li, Soroush Ghodrati, A. Yazdanbakhsh, H. Esmaeilzadeh, Mingu Kang\",\"doi\":\"10.1145/3470496.3527423\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9×and 3.9×speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (< 0.2% degradation).\",\"PeriodicalId\":337932,\"journal\":{\"name\":\"Proceedings of the 49th Annual International Symposium on Computer Architecture\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 49th Annual International Symposium on Computer Architecture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3470496.3527423\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3470496.3527423","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

自关注是各种基于变换的自然语言处理模型的最先进的准确性的关键促成因素。这种注意力机制计算出句子中每个单词与其他单词的相关分数。通常，只有一小部分单词与所关注的单词高度相关，这只能在运行时确定。因此，由于注意力得分低，大量的计算是无关紧要的，并且可能被修剪。主要的挑战是找到分数的阈值，低于这个阈值，后续的计算就不重要了。虽然这样的阈值是离散的，但本文通过一个集成到训练损失函数中的软可微正则器来表达它的搜索。该公式依赖于反向传播训练，同时对阈值和权值进行分析共优化，在精度和计算修剪之间取得了形式上的最优平衡。为了最好地利用这一数学创新，我们设计了一个位串行架构，称为LeOPArd，用于具有位级早期终止微架构机制的转换器语言模型。我们评估了MemN2N、BERT、ALBERT、GPT-2和Vision变压器模型的43个后端任务。布局后的结果表明，平均而言，LeOPArd分别产生1.9×and 3.9×speedup和能量降低，同时保持平均精度几乎不变(< 0.2%的下降)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accelerating attention through gradient-based learned runtime pruning

Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9×and 3.9×speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (< 0.2% degradation).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 49th Annual International Symposium on Computer Architecture

自引率

0.00%

发文量