{"title":"Topkima-Former:基于Top-k内存ADC的变压器低能量、低延迟推理","authors":"Shuai Dong;Junyi Yang;Xiaoqi Peng;Hongyang Shang;Ye Ke;Xiaofeng Yang;Hongjie Liu;Arindam Basu","doi":"10.1109/TCSI.2025.3549060","DOIUrl":null,"url":null,"abstract":"Transformer has emerged as a leading architecture in neural language processing (NLP) and computer vision (CV). However, the extensive use of nonlinear operations, like softmax, poses a performance bottleneck during transformer inference and comprises up to 40% of the total latency. Hence, we propose innovations at the circuit, algorithm and architecture levels to accelerate the transformer. At the circuit level, we propose Topkima—combining top-<italic>k</i> activation selection with in-memory ADC (IMA) to implement efficient softmax without any sorting overhead. Only the <italic>k</i> largest activations are sent to softmax calculation block, reducing the huge computational cost of softmax. At the algorithmic level, a modified training scheme utilizes top-<italic>k</i> activations only during the forward pass, combined with a sub-top-<italic>k</i> method to address the crossbar size limitation by aggregating each sub-top-<italic>k</i> values as global top-<italic>k</i>. At the architecture level, we introduce a fine pipeline for efficiently scheduling data flows and an improved scale-free technique for removing scaling cost. The combined system, dubbed Topkima-Former, enhances <inline-formula> <tex-math>$1.8\\times -84\\times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$1.2\\times -36\\times $ </tex-math></inline-formula> energy efficiency (EE) over prior In-memory computing (IMC) accelerators. Compared to a conventional softmax macro and a digital top-<italic>k</i> (Dtopk) softmax macro, our proposed Topkima softmax macro achieves about <inline-formula> <tex-math>$15\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$8\\times $ </tex-math></inline-formula> faster speed respectively. Experimental evaluations demonstrate minimal (0.42% to 1.60%) accuracy loss for different models in both vision and NLP tasks.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 6","pages":"2509-2519"},"PeriodicalIF":5.2000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Topkima-Former: Low-Energy, Low-Latency Inference for Transformers Using Top-k In-Memory ADC\",\"authors\":\"Shuai Dong;Junyi Yang;Xiaoqi Peng;Hongyang Shang;Ye Ke;Xiaofeng Yang;Hongjie Liu;Arindam Basu\",\"doi\":\"10.1109/TCSI.2025.3549060\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer has emerged as a leading architecture in neural language processing (NLP) and computer vision (CV). However, the extensive use of nonlinear operations, like softmax, poses a performance bottleneck during transformer inference and comprises up to 40% of the total latency. Hence, we propose innovations at the circuit, algorithm and architecture levels to accelerate the transformer. At the circuit level, we propose Topkima—combining top-<italic>k</i> activation selection with in-memory ADC (IMA) to implement efficient softmax without any sorting overhead. Only the <italic>k</i> largest activations are sent to softmax calculation block, reducing the huge computational cost of softmax. At the algorithmic level, a modified training scheme utilizes top-<italic>k</i> activations only during the forward pass, combined with a sub-top-<italic>k</i> method to address the crossbar size limitation by aggregating each sub-top-<italic>k</i> values as global top-<italic>k</i>. At the architecture level, we introduce a fine pipeline for efficiently scheduling data flows and an improved scale-free technique for removing scaling cost. The combined system, dubbed Topkima-Former, enhances <inline-formula> <tex-math>$1.8\\\\times -84\\\\times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$1.2\\\\times -36\\\\times $ </tex-math></inline-formula> energy efficiency (EE) over prior In-memory computing (IMC) accelerators. Compared to a conventional softmax macro and a digital top-<italic>k</i> (Dtopk) softmax macro, our proposed Topkima softmax macro achieves about <inline-formula> <tex-math>$15\\\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$8\\\\times $ </tex-math></inline-formula> faster speed respectively. Experimental evaluations demonstrate minimal (0.42% to 1.60%) accuracy loss for different models in both vision and NLP tasks.\",\"PeriodicalId\":13039,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"volume\":\"72 6\",\"pages\":\"2509-2519\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-03-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10931119/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10931119/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Topkima-Former: Low-Energy, Low-Latency Inference for Transformers Using Top-k In-Memory ADC
Transformer has emerged as a leading architecture in neural language processing (NLP) and computer vision (CV). However, the extensive use of nonlinear operations, like softmax, poses a performance bottleneck during transformer inference and comprises up to 40% of the total latency. Hence, we propose innovations at the circuit, algorithm and architecture levels to accelerate the transformer. At the circuit level, we propose Topkima—combining top-k activation selection with in-memory ADC (IMA) to implement efficient softmax without any sorting overhead. Only the k largest activations are sent to softmax calculation block, reducing the huge computational cost of softmax. At the algorithmic level, a modified training scheme utilizes top-k activations only during the forward pass, combined with a sub-top-k method to address the crossbar size limitation by aggregating each sub-top-k values as global top-k. At the architecture level, we introduce a fine pipeline for efficiently scheduling data flows and an improved scale-free technique for removing scaling cost. The combined system, dubbed Topkima-Former, enhances $1.8\times -84\times $ speedup and $1.2\times -36\times $ energy efficiency (EE) over prior In-memory computing (IMC) accelerators. Compared to a conventional softmax macro and a digital top-k (Dtopk) softmax macro, our proposed Topkima softmax macro achieves about $15\times $ and $8\times $ faster speed respectively. Experimental evaluations demonstrate minimal (0.42% to 1.60%) accuracy loss for different models in both vision and NLP tasks.
期刊介绍:
TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.