Topkima-Former：基于Top-k内存ADC的变压器低能量、低延迟推理

IF 5.2 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems I: Regular Papers Pub Date : 2025-03-18 DOI:10.1109/TCSI.2025.3549060

Shuai Dong;Junyi Yang;Xiaoqi Peng;Hongyang Shang;Ye Ke;Xiaofeng Yang;Hongjie Liu;Arindam Basu

{"title":"Topkima-Former：基于Top-k内存ADC的变压器低能量、低延迟推理","authors":"Shuai Dong;Junyi Yang;Xiaoqi Peng;Hongyang Shang;Ye Ke;Xiaofeng Yang;Hongjie Liu;Arindam Basu","doi":"10.1109/TCSI.2025.3549060","DOIUrl":null,"url":null,"abstract":"Transformer has emerged as a leading architecture in neural language processing (NLP) and computer vision (CV). However, the extensive use of nonlinear operations, like softmax, poses a performance bottleneck during transformer inference and comprises up to 40% of the total latency. Hence, we propose innovations at the circuit, algorithm and architecture levels to accelerate the transformer. At the circuit level, we propose Topkima—combining top-<italic>k activation selection with in-memory ADC (IMA) to implement efficient softmax without any sorting overhead. Only the <italic>k largest activations are sent to softmax calculation block, reducing the huge computational cost of softmax. At the algorithmic level, a modified training scheme utilizes top-<italic>k activations only during the forward pass, combined with a sub-top-<italic>k method to address the crossbar size limitation by aggregating each sub-top-<italic>k values as global top-<italic>k. At the architecture level, we introduce a fine pipeline for efficiently scheduling data flows and an improved scale-free technique for removing scaling cost. The combined system, dubbed Topkima-Former, enhances <inline-formula> <tex-math>$1.8\\times -84\\times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$1.2\\times -36\\times $ </tex-math></inline-formula> energy efficiency (EE) over prior In-memory computing (IMC) accelerators. Compared to a conventional softmax macro and a digital top-<italic>k (Dtopk) softmax macro, our proposed Topkima softmax macro achieves about <inline-formula> <tex-math>$15\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$8\\times $ </tex-math></inline-formula> faster speed respectively. Experimental evaluations demonstrate minimal (0.42% to 1.60%) accuracy loss for different models in both vision and NLP tasks.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 6","pages":"2509-2519"},"PeriodicalIF":5.2000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Topkima-Former: Low-Energy, Low-Latency Inference for Transformers Using Top-k In-Memory ADC\",\"authors\":\"Shuai Dong;Junyi Yang;Xiaoqi Peng;Hongyang Shang;Ye Ke;Xiaofeng Yang;Hongjie Liu;Arindam Basu\",\"doi\":\"10.1109/TCSI.2025.3549060\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer has emerged as a leading architecture in neural language processing (NLP) and computer vision (CV). However, the extensive use of nonlinear operations, like softmax, poses a performance bottleneck during transformer inference and comprises up to 40% of the total latency. Hence, we propose innovations at the circuit, algorithm and architecture levels to accelerate the transformer. At the circuit level, we propose Topkima—combining top-<italic>k activation selection with in-memory ADC (IMA) to implement efficient softmax without any sorting overhead. Only the <italic>k largest activations are sent to softmax calculation block, reducing the huge computational cost of softmax. At the algorithmic level, a modified training scheme utilizes top-<italic>k activations only during the forward pass, combined with a sub-top-<italic>k method to address the crossbar size limitation by aggregating each sub-top-<italic>k values as global top-<italic>k. At the architecture level, we introduce a fine pipeline for efficiently scheduling data flows and an improved scale-free technique for removing scaling cost. The combined system, dubbed Topkima-Former, enhances <inline-formula> <tex-math>$1.8\\\\times -84\\\\times $ </tex-math></inline-formula> speedup and <inline-formula> <tex-math>$1.2\\\\times -36\\\\times $ </tex-math></inline-formula> energy efficiency (EE) over prior In-memory computing (IMC) accelerators. Compared to a conventional softmax macro and a digital top-<italic>k (Dtopk) softmax macro, our proposed Topkima softmax macro achieves about <inline-formula> <tex-math>$15\\\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$8\\\\times $ </tex-math></inline-formula> faster speed respectively. Experimental evaluations demonstrate minimal (0.42% to 1.60%) accuracy loss for different models in both vision and NLP tasks.\",\"PeriodicalId\":13039,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"volume\":\"72 6\",\"pages\":\"2509-2519\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-03-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems I: Regular Papers\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10931119/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10931119/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

Transformer已经成为神经语言处理（NLP）和计算机视觉（CV）领域的领先架构。然而，像softmax这样的非线性操作的广泛使用，在变压器推理期间造成了性能瓶颈，并且占总延迟的40%。因此，我们提出在电路、算法和架构层面的创新来加速变压器。在电路层面，我们提出了结合topkima的top-k激活选择和内存ADC （IMA）来实现高效的softmax，而不需要任何排序开销。只有k个最大的激活被发送到softmax计算块，减少了softmax巨大的计算成本。在算法层面，一种改进的训练方案仅在向前传递期间使用top-k激活，并结合子top-k方法，通过将每个子top-k值聚合为全局top-k来解决crossbar大小的限制。在架构级别，我们引入了一个精细的管道来有效地调度数据流，并引入了一种改进的无缩放技术来消除缩放成本。这一组合系统被称为Topkima-Former，与之前的内存计算（IMC）加速器相比，其加速速度提高了1.8倍至84倍，能效（EE）提高了1.2倍至36倍。与传统的softmax宏和数字top-k (Dtopk) softmax宏相比，我们提出的Topkima softmax宏的速度分别提高了15倍和8倍。实验评估表明，在视觉和自然语言处理任务中，不同模型的准确率损失最小（0.42%至1.60%）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Topkima-Former: Low-Energy, Low-Latency Inference for Transformers Using Top-k In-Memory ADC

Transformer has emerged as a leading architecture in neural language processing (NLP) and computer vision (CV). However, the extensive use of nonlinear operations, like softmax, poses a performance bottleneck during transformer inference and comprises up to 40% of the total latency. Hence, we propose innovations at the circuit, algorithm and architecture levels to accelerate the transformer. At the circuit level, we propose Topkima—combining top-k activation selection with in-memory ADC (IMA) to implement efficient softmax without any sorting overhead. Only the k largest activations are sent to softmax calculation block, reducing the huge computational cost of softmax. At the algorithmic level, a modified training scheme utilizes top-k activations only during the forward pass, combined with a sub-top-k method to address the crossbar size limitation by aggregating each sub-top-k values as global top-k. At the architecture level, we introduce a fine pipeline for efficiently scheduling data flows and an improved scale-free technique for removing scaling cost. The combined system, dubbed Topkima-Former, enhances

$1.8\times -84\times $

speedup and

$1.2\times -36\times $

energy efficiency (EE) over prior In-memory computing (IMC) accelerators. Compared to a conventional softmax macro and a digital top-k (Dtopk) softmax macro, our proposed Topkima softmax macro achieves about

$15\times $

and

$8\times $

faster speed respectively. Experimental evaluations demonstrate minimal (0.42% to 1.60%) accuracy loss for different models in both vision and NLP tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Circuits and Systems I: Regular Papers 工程技术-工程：电子与电气

CiteScore

9.80

自引率

11.80%

发文量

441

审稿时长

2 months

期刊介绍： TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.