基于高精度加速Softmax和GELU的边缘生成人工智能灵活模板

IF 3.7 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti
{"title":"基于高精度加速Softmax和GELU的边缘生成人工智能灵活模板","authors":"Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti","doi":"10.1109/JETCAS.2025.3562734","DOIUrl":null,"url":null,"abstract":"Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a <inline-formula> <tex-math>$24\\times 8$ </tex-math></inline-formula> systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (<inline-formula> <tex-math>$121\\times $ </tex-math></inline-formula> speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm<sup>2</sup>, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to <inline-formula> <tex-math>$10.8\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.11\\times $ </tex-math></inline-formula>, respectively, while reducing their energy consumption by up to <inline-formula> <tex-math>$10.8\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.29\\times $ </tex-math></inline-formula>. These enhancements translate into a <inline-formula> <tex-math>$1.58\\times $ </tex-math></inline-formula> increase in throughput (310 GOPS at 0.8 V) and a <inline-formula> <tex-math>$1.42\\times $ </tex-math></inline-formula> improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"200-216"},"PeriodicalIF":3.7000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Flexible Template for Edge Generative AI With High-Accuracy Accelerated Softmax and GELU\",\"authors\":\"Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti\",\"doi\":\"10.1109/JETCAS.2025.3562734\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a <inline-formula> <tex-math>$24\\\\times 8$ </tex-math></inline-formula> systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (<inline-formula> <tex-math>$121\\\\times $ </tex-math></inline-formula> speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm<sup>2</sup>, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to <inline-formula> <tex-math>$10.8\\\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.11\\\\times $ </tex-math></inline-formula>, respectively, while reducing their energy consumption by up to <inline-formula> <tex-math>$10.8\\\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.29\\\\times $ </tex-math></inline-formula>. These enhancements translate into a <inline-formula> <tex-math>$1.58\\\\times $ </tex-math></inline-formula> increase in throughput (310 GOPS at 0.8 V) and a <inline-formula> <tex-math>$1.42\\\\times $ </tex-math></inline-formula> improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.\",\"PeriodicalId\":48827,\"journal\":{\"name\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"volume\":\"15 2\",\"pages\":\"200-216\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10971415/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10971415/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

基于变压器的生成式人工智能(GenAI)模型在包括自然语言处理、计算机视觉和音频处理在内的广泛领域取得了显著的成果。然而,这是以增加复杂性和需要复杂的非线性(如softmax和GELU)为代价的。即使transformer在计算上由矩阵乘法(MatMul)主导,这些非线性也可能成为性能瓶颈,特别是如果使用专用硬件来加速MatMul运算符。在这项工作中,我们介绍了一个基于异构紧密耦合集群的GenAI BFloat16 Transformer加速模板,该集群包含256KiB的共享SRAM, 8个通用RISC-V内核,一个$24\ × 8$收缩阵列MatMul加速器,以及一个用于Transformer softmax, GELU和SiLU非线性的新型加速器:SoftEx。SoftEx引入了一种近似的指数算法来平衡效率(比glibc的实现加速121倍)和精度(平均相对误差为0.14%)。在12nm技术中,SoftEx占据了0.039 mm2,仅占集群的3.22%,实现了1.12 GHz的工作频率。与运行在RISC-V内核上的优化软件相比,SoftEx实现了显著的改进,分别将softmax和GELU的计算速度提高了10.8倍和5.11倍,同时将能耗降低了10.8倍和5.29倍。这些增强转化为端到端ViT推断工作负载的吞吐量增加1.58美元(0.8 V时310 GOPS),能效提高1.42美元(0.55 V时1.34 TOPS/W)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Flexible Template for Edge Generative AI With High-Accuracy Accelerated Softmax and GELU
Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a $24\times 8$ systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency ( $121\times $ speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm2, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to $10.8\times $ and $5.11\times $ , respectively, while reducing their energy consumption by up to $10.8\times $ and $5.29\times $ . These enhancements translate into a $1.58\times $ increase in throughput (310 GOPS at 0.8 V) and a $1.42\times $ improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
8.50
自引率
2.20%
发文量
86
期刊介绍: The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信