Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti
{"title":"基于高精度加速Softmax和GELU的边缘生成人工智能灵活模板","authors":"Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti","doi":"10.1109/JETCAS.2025.3562734","DOIUrl":null,"url":null,"abstract":"Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a <inline-formula> <tex-math>$24\\times 8$ </tex-math></inline-formula> systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (<inline-formula> <tex-math>$121\\times $ </tex-math></inline-formula> speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm<sup>2</sup>, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to <inline-formula> <tex-math>$10.8\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.11\\times $ </tex-math></inline-formula>, respectively, while reducing their energy consumption by up to <inline-formula> <tex-math>$10.8\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.29\\times $ </tex-math></inline-formula>. These enhancements translate into a <inline-formula> <tex-math>$1.58\\times $ </tex-math></inline-formula> increase in throughput (310 GOPS at 0.8 V) and a <inline-formula> <tex-math>$1.42\\times $ </tex-math></inline-formula> improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"15 2","pages":"200-216"},"PeriodicalIF":3.7000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Flexible Template for Edge Generative AI With High-Accuracy Accelerated Softmax and GELU\",\"authors\":\"Andrea Belano;Yvan Tortorella;Angelo Garofalo;Luca Benini;Davide Rossi;Francesco Conti\",\"doi\":\"10.1109/JETCAS.2025.3562734\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a <inline-formula> <tex-math>$24\\\\times 8$ </tex-math></inline-formula> systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (<inline-formula> <tex-math>$121\\\\times $ </tex-math></inline-formula> speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm<sup>2</sup>, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to <inline-formula> <tex-math>$10.8\\\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.11\\\\times $ </tex-math></inline-formula>, respectively, while reducing their energy consumption by up to <inline-formula> <tex-math>$10.8\\\\times $ </tex-math></inline-formula> and <inline-formula> <tex-math>$5.29\\\\times $ </tex-math></inline-formula>. These enhancements translate into a <inline-formula> <tex-math>$1.58\\\\times $ </tex-math></inline-formula> increase in throughput (310 GOPS at 0.8 V) and a <inline-formula> <tex-math>$1.42\\\\times $ </tex-math></inline-formula> improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.\",\"PeriodicalId\":48827,\"journal\":{\"name\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"volume\":\"15 2\",\"pages\":\"200-216\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10971415/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10971415/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
A Flexible Template for Edge Generative AI With High-Accuracy Accelerated Softmax and GELU
Transformer-based generative Artificial Intelligence (GenAI) models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. However, this comes at the cost of increased complexity and the need of sophisticated non-linearities such as softmax and GELU. Even if Transformers are computationally dominated by matrix multiplications (MatMul), these non-linearities can become a performance bottleneck, especially if dedicated hardware is used to accelerate MatMul operators. In this work, we introduce a GenAI BFloat16 Transformer acceleration template based on a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM, 8 general-purpose RISC-V cores, a $24\times 8$ systolic array MatMul accelerator, and a novel accelerator for Transformer softmax, GELU and SiLU non-linearities: SoftEx. SoftEx introduces an approximate exponentiation algorithm balancing efficiency ($121\times $ speedup over glibc’s implementation) with accuracy (mean relative error of 0.14%). In 12 nm technology, SoftEx occupies 0.039 mm2, only 3.22% of the cluster, which achieves an operating frequency of 1.12 GHz. Compared to optimized software running on the RISC-V cores, SoftEx achieves significant improvements, accelerating softmax and GELU computations by up to $10.8\times $ and $5.11\times $ , respectively, while reducing their energy consumption by up to $10.8\times $ and $5.29\times $ . These enhancements translate into a $1.58\times $ increase in throughput (310 GOPS at 0.8 V) and a $1.42\times $ improvement in energy efficiency (1.34 TOPS/W at 0.55 V) on end-to-end ViT inference workloads.
期刊介绍:
The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.