基于渐近稀疏推测和乱序计算的28nm 27.5TOPS/W近似计算变压器处理器

2022 IEEE International Solid- State Circuits Conference (ISSCC) Pub Date : 2022-02-20 DOI:10.1109/ISSCC42614.2022.9731686

Yang Wang, Yubin Qin, Dazheng Deng, Jingchuang Wei, Yang Zhou, Yuanqi Fan, Tianbao Chen, Hao-Fen Sun, Leibo Liu, Shaojun Wei, S. Yin

{"title":"基于渐近稀疏推测和乱序计算的28nm 27.5TOPS/W近似计算变压器处理器","authors":"Yang Wang, Yubin Qin, Dazheng Deng, Jingchuang Wei, Yang Zhou, Yuanqi Fan, Tianbao Chen, Hao-Fen Sun, Leibo Liu, Shaojun Wei, S. Yin","doi":"10.1109/ISSCC42614.2022.9731686","DOIUrl":null,"url":null,"abstract":"Recently, Transformer-based models have achieved tremendous success in many AI fields, from NLP to CV, using the attention mechanism [1]–[3]. This mechanism captures the global correlations of input by indicating every two tokens' relevance with attention scores and uses normalized scores, defined as attention probabilities, to weight all input tokens to obtain output tokens with a global receptive field. A Transformer model consists of multiple blocks, named multi-head, working with the attention mechanism. Figure 29.2.1 details the computation of an attention block with query (Q), key (K), and value-matrix (V), computed by tokens and weight matrices. First, Q is multiplied by KT to generate the attention score matrix. The scores in each row, represented as $\\mathrm{X}_{\\mathrm i}$, indicate a token's relevance with all others. Second, the row-wise softmax with inputs of $\\mathrm{X}_{\\mathrm{i}}-\\mathrm{X}_{\\max}$ normalizes attention scores to probabilities (P), expanding the large scores and reducing the small scores exponentially. Finally, probabilities are quantized and then multiplied by V to produce the output. Each output token is a weighted sum of all input tokens, where the strongly related tokens have large weight values. Global attention-based models achieve 20.4% higher accuracy than LSTM for NLP and 15.1% higher accuracy than ResNet-152 for classification.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"51 1","pages":"1-3"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing\",\"authors\":\"Yang Wang, Yubin Qin, Dazheng Deng, Jingchuang Wei, Yang Zhou, Yuanqi Fan, Tianbao Chen, Hao-Fen Sun, Leibo Liu, Shaojun Wei, S. Yin\",\"doi\":\"10.1109/ISSCC42614.2022.9731686\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, Transformer-based models have achieved tremendous success in many AI fields, from NLP to CV, using the attention mechanism [1]–[3]. This mechanism captures the global correlations of input by indicating every two tokens' relevance with attention scores and uses normalized scores, defined as attention probabilities, to weight all input tokens to obtain output tokens with a global receptive field. A Transformer model consists of multiple blocks, named multi-head, working with the attention mechanism. Figure 29.2.1 details the computation of an attention block with query (Q), key (K), and value-matrix (V), computed by tokens and weight matrices. First, Q is multiplied by KT to generate the attention score matrix. The scores in each row, represented as $\\\\mathrm{X}_{\\\\mathrm i}$, indicate a token's relevance with all others. Second, the row-wise softmax with inputs of $\\\\mathrm{X}_{\\\\mathrm{i}}-\\\\mathrm{X}_{\\\\max}$ normalizes attention scores to probabilities (P), expanding the large scores and reducing the small scores exponentially. Finally, probabilities are quantized and then multiplied by V to produce the output. Each output token is a weighted sum of all input tokens, where the strongly related tokens have large weight values. Global attention-based models achieve 20.4% higher accuracy than LSTM for NLP and 15.1% higher accuracy than ResNet-152 for classification.\",\"PeriodicalId\":6830,\"journal\":{\"name\":\"2022 IEEE International Solid- State Circuits Conference (ISSCC)\",\"volume\":\"51 1\",\"pages\":\"1-3\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Solid- State Circuits Conference (ISSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC42614.2022.9731686\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC42614.2022.9731686","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

最近，基于transformer的模型利用注意力机制在许多AI领域取得了巨大的成功，从NLP到CV[1] -[3]。该机制通过指示每两个标记与注意分数的相关性来捕获输入的全局相关性，并使用标准化分数(定义为注意概率)对所有输入标记进行加权，以获得具有全局接受域的输出标记。Transformer模型由多个块组成，称为多头块，使用注意机制。图29.2.1详细描述了查询(Q)、键值(K)和值矩阵(V)的注意块计算，通过令牌和权重矩阵进行计算。首先，将Q乘以KT生成注意力得分矩阵。每行中的分数表示为$\ mathm {X}_{\ mathm i}$，表示一个令牌与所有其他令牌的相关性。其次，输入为$\mathrm{X}_{\mathrm{i}}-\mathrm{X}_{\max}$的逐行softmax将注意力分数归一化为概率(P)，扩展大分数并以指数方式减少小分数。最后，概率被量化，然后乘以V来产生输出。每个输出标记是所有输入标记的加权和，其中强相关标记具有较大的权重值。基于全局注意力的模型在NLP方面的准确率比LSTM高20.4%，在分类方面的准确率比ResNet-152高15.1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing

Recently, Transformer-based models have achieved tremendous success in many AI fields, from NLP to CV, using the attention mechanism [1]–[3]. This mechanism captures the global correlations of input by indicating every two tokens' relevance with attention scores and uses normalized scores, defined as attention probabilities, to weight all input tokens to obtain output tokens with a global receptive field. A Transformer model consists of multiple blocks, named multi-head, working with the attention mechanism. Figure 29.2.1 details the computation of an attention block with query (Q), key (K), and value-matrix (V), computed by tokens and weight matrices. First, Q is multiplied by KT to generate the attention score matrix. The scores in each row, represented as $\mathrm{X}_{\mathrm i}$, indicate a token's relevance with all others. Second, the row-wise softmax with inputs of $\mathrm{X}_{\mathrm{i}}-\mathrm{X}_{\max}$ normalizes attention scores to probabilities (P), expanding the large scores and reducing the small scores exponentially. Finally, probabilities are quantized and then multiplied by V to produce the output. Each output token is a weighted sum of all input tokens, where the strongly related tokens have large weight values. Global attention-based models achieve 20.4% higher accuracy than LSTM for NLP and 15.1% higher accuracy than ResNet-152 for classification.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 IEEE International Solid- State Circuits Conference (ISSCC)

自引率

0.00%

发文量