基于高精度加速Softmax和GELU的边缘生成人工智能灵活模板

IF 3.8 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems Pub Date : 2025-04-21 DOI:10.1109/JETCAS.2025.3562734

Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti

{"title":"基于高精度加速Softmax和GELU的边缘生成人工智能灵活模板","authors":"Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti","doi":"10.1109/JETCAS.2025.3562734","DOIUrl":null,"url":null,"abstract":"Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a <inline-formula> <tex-math>$24\\times 8$ </tex-math></inline-formula> systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (<inline-formula> <tex-math>$121\\times $ </tex-math></inline-formula> speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm<sup>2</sup>, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to <inline-formula> <tex-math>$10.8\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.11\\times $ </tex-math></inline-formula>, respectively, while reducing their energy consumption by up to <inline-formula> <tex-math>$10.8\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.29\\times $ </tex-math></inline-formula>. These enhancements translate into a <inline-formula> <tex-math>$1.58\\times $ </tex-math></inline-formula> increase in throughput (310 GOPS at 0.8 V) and a <inline-formula> <tex-math>$1.42\\times $ </tex-math></inline-formula> improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"200-216"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Flexible Template for Edge Generative AI With High-Accuracy Accelerated Softmax and GELU\",\"authors\":\"Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti\",\"doi\":\"10.1109/JETCAS.2025.3562734\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a <inline-formula> <tex-math>$24\\\\times 8$ </tex-math></inline-formula> systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (<inline-formula> <tex-math>$121\\\\times $ </tex-math></inline-formula> speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm<sup>2</sup>, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to <inline-formula> <tex-math>$10.8\\\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.11\\\\times $ </tex-math></inline-formula>, respectively, while reducing their energy consumption by up to <inline-formula> <tex-math>$10.8\\\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.29\\\\times $ </tex-math></inline-formula>. These enhancements translate into a <inline-formula> <tex-math>$1.58\\\\times $ </tex-math></inline-formula> increase in throughput (310 GOPS at 0.8 V) and a <inline-formula> <tex-math>$1.42\\\\times $ </tex-math></inline-formula> improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.\",\"PeriodicalId\":48827,\"journal\":{\"name\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"volume\":\"15 2\",\"pages\":\"200-216\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10971415/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10971415/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

基于变压器的生成式人工智能（GenAI）模型在包括自然语言处理、计算机视觉和音频处理在内的广泛领域取得了显著的成果。然而，这是以增加复杂性和需要复杂的非线性（如softmax和GELU）为代价的。即使transformer在计算上由矩阵乘法（MatMul）主导，这些非线性也可能成为性能瓶颈，特别是如果使用专用硬件来加速MatMul运算符。在这项工作中，我们介绍了一个基于异构紧密耦合集群的GenAI BFloat16 Transformer加速模板，该集群包含256KiB的共享SRAM， 8个通用RISC-V内核，一个$24\ × 8$收缩阵列MatMul加速器，以及一个用于Transformer softmax， GELU和SiLU非线性的新型加速器：SoftEx。SoftEx引入了一种近似的指数算法来平衡效率（比glibc的实现加速121倍）和精度（平均相对误差为0.14%）。在12nm技术中，SoftEx占据了0.039 mm2，仅占集群的3.22%，实现了1.12 GHz的工作频率。与运行在RISC-V内核上的优化软件相比，SoftEx实现了显著的改进，分别将softmax和GELU的计算速度提高了10.8倍和5.11倍，同时将能耗降低了10.8倍和5.29倍。这些增强转化为端到端ViT推断工作负载的吞吐量增加1.58美元（0.8 V时310 GOPS），能效提高1.42美元（0.55 V时1.34 TOPS/W）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Flexible Template for Edge Generative AI With High-Accuracy Accelerated Softmax and GELU

Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a

$24\times 8$

systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (

$121\times $

speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm², only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to

$10.8\times $

and

$5.11\times $

, respectively, while reducing their energy consumption by up to

$10.8\times $

and

$5.29\times $

. These enhancements translate into a

$1.58\times $

increase in throughput (310 GOPS at 0.8 V) and a

$1.42\times $

improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.50

自引率

2.20%

发文量

期刊介绍： The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.