Yang Wang, Yubin Qin, Dazheng Deng, Jingchuang Wei, Yang Zhou, Yuanqi Fan, Tianbao Chen, Hao-Fen Sun, Leibo Liu, Shaojun Wei, S. Yin
{"title":"基于渐近稀疏推测和乱序计算的28nm 27.5TOPS/W近似计算变压器处理器","authors":"Yang Wang, Yubin Qin, Dazheng Deng, Jingchuang Wei, Yang Zhou, Yuanqi Fan, Tianbao Chen, Hao-Fen Sun, Leibo Liu, Shaojun Wei, S. Yin","doi":"10.1109/ISSCC42614.2022.9731686","DOIUrl":null,"url":null,"abstract":"Recently, Transformer-based models have achieved tremendous success in many AI fields, from NLP to CV, using the attention mechanism [1]–[3]. This mechanism captures the global correlations of input by indicating every two tokens' relevance with attention scores and uses normalized scores, defined as attention probabilities, to weight all input tokens to obtain output tokens with a global receptive field. A Transformer model consists of multiple blocks, named multi-head, working with the attention mechanism. Figure 29.2.1 details the computation of an attention block with query (Q), key (K), and value-matrix (V), computed by tokens and weight matrices. First, Q is multiplied by KT to generate the attention score matrix. The scores in each row, represented as $\\mathrm{X}_{\\mathrm i}$, indicate a token's relevance with all others. Second, the row-wise softmax with inputs of $\\mathrm{X}_{\\mathrm{i}}-\\mathrm{X}_{\\max}$ normalizes attention scores to probabilities (P), expanding the large scores and reducing the small scores exponentially. Finally, probabilities are quantized and then multiplied by V to produce the output. Each output token is a weighted sum of all input tokens, where the strongly related tokens have large weight values. Global attention-based models achieve 20.4% higher accuracy than LSTM for NLP and 15.1% higher accuracy than ResNet-152 for classification.","PeriodicalId":6830,"journal":{"name":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","volume":"51 1","pages":"1-3"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing\",\"authors\":\"Yang Wang, Yubin Qin, Dazheng Deng, Jingchuang Wei, Yang Zhou, Yuanqi Fan, Tianbao Chen, Hao-Fen Sun, Leibo Liu, Shaojun Wei, S. Yin\",\"doi\":\"10.1109/ISSCC42614.2022.9731686\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, Transformer-based models have achieved tremendous success in many AI fields, from NLP to CV, using the attention mechanism [1]–[3]. This mechanism captures the global correlations of input by indicating every two tokens' relevance with attention scores and uses normalized scores, defined as attention probabilities, to weight all input tokens to obtain output tokens with a global receptive field. A Transformer model consists of multiple blocks, named multi-head, working with the attention mechanism. Figure 29.2.1 details the computation of an attention block with query (Q), key (K), and value-matrix (V), computed by tokens and weight matrices. First, Q is multiplied by KT to generate the attention score matrix. The scores in each row, represented as $\\\\mathrm{X}_{\\\\mathrm i}$, indicate a token's relevance with all others. Second, the row-wise softmax with inputs of $\\\\mathrm{X}_{\\\\mathrm{i}}-\\\\mathrm{X}_{\\\\max}$ normalizes attention scores to probabilities (P), expanding the large scores and reducing the small scores exponentially. Finally, probabilities are quantized and then multiplied by V to produce the output. Each output token is a weighted sum of all input tokens, where the strongly related tokens have large weight values. Global attention-based models achieve 20.4% higher accuracy than LSTM for NLP and 15.1% higher accuracy than ResNet-152 for classification.\",\"PeriodicalId\":6830,\"journal\":{\"name\":\"2022 IEEE International Solid- State Circuits Conference (ISSCC)\",\"volume\":\"51 1\",\"pages\":\"1-3\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-02-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Solid- State Circuits Conference (ISSCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSCC42614.2022.9731686\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Solid- State Circuits Conference (ISSCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSCC42614.2022.9731686","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A 28nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing
Recently, Transformer-based models have achieved tremendous success in many AI fields, from NLP to CV, using the attention mechanism [1]–[3]. This mechanism captures the global correlations of input by indicating every two tokens' relevance with attention scores and uses normalized scores, defined as attention probabilities, to weight all input tokens to obtain output tokens with a global receptive field. A Transformer model consists of multiple blocks, named multi-head, working with the attention mechanism. Figure 29.2.1 details the computation of an attention block with query (Q), key (K), and value-matrix (V), computed by tokens and weight matrices. First, Q is multiplied by KT to generate the attention score matrix. The scores in each row, represented as $\mathrm{X}_{\mathrm i}$, indicate a token's relevance with all others. Second, the row-wise softmax with inputs of $\mathrm{X}_{\mathrm{i}}-\mathrm{X}_{\max}$ normalizes attention scores to probabilities (P), expanding the large scores and reducing the small scores exponentially. Finally, probabilities are quantized and then multiplied by V to produce the output. Each output token is a weighted sum of all input tokens, where the strongly related tokens have large weight values. Global attention-based models achieve 20.4% higher accuracy than LSTM for NLP and 15.1% higher accuracy than ResNet-152 for classification.