EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models

IF 5.2 1区 工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC
Mingqiang Huang;Ao Shen;Kai Li;Haoxiang Peng;Boyu Li;Yupeng Su;Hao Yu
{"title":"EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models","authors":"Mingqiang Huang;Ao Shen;Kai Li;Haoxiang Peng;Boyu Li;Yupeng Su;Hao Yu","doi":"10.1109/TCSI.2025.3546256","DOIUrl":null,"url":null,"abstract":"The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, it is still a challenge to deploy LLMs on resource-constrained edge devices (such as robots), due to the intensive computation requirements, heavy memory access, diverse operator types and difficulties in compilation. In this work, we proposed EdgeLLM to address the above issues. Firstly, focusing on the computation, we designed mix-precision processing element array together with group systolic architecture, that can efficiently support both FP<inline-formula> <tex-math>$16\\ast $ </tex-math></inline-formula>FP16 for the MHA block (Multi-Head Attention) and FP<inline-formula> <tex-math>$16\\ast $ </tex-math></inline-formula>INT4 for the FFN layer (Feed-Forward Network). Meanwhile specific optimization on log-scale structured weight sparsity, has been used to further increase the efficiency. Secondly, to address the compilation and deployment issue, we analyzed the whole operators within LLM models and developed a universal data parallelism scheme, by which all of the input and output features maintain the same data shape, enabling to process different operators without any data rearrangement. Then we proposed an end-to-end compiler to map the whole LLM model on CPU-FPGA heterogeneous system (AMD Xilinx VCU128 FPGA). The accelerator achieves <inline-formula> <tex-math>$1.91\\times $ </tex-math></inline-formula> higher throughput and <inline-formula> <tex-math>$7.55\\times $ </tex-math></inline-formula> higher energy efficiency than the commercial GPU (NVIDIA A100-SXM4-80G). When compared with state-of-the-art FPGA accelerator of FlightLLM, it shows 10-24% better performance in terms of HBM bandwidth utilization, energy efficiency and LLM throughput.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 7","pages":"3352-3365"},"PeriodicalIF":5.2000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10916480/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, it is still a challenge to deploy LLMs on resource-constrained edge devices (such as robots), due to the intensive computation requirements, heavy memory access, diverse operator types and difficulties in compilation. In this work, we proposed EdgeLLM to address the above issues. Firstly, focusing on the computation, we designed mix-precision processing element array together with group systolic architecture, that can efficiently support both FP $16\ast $ FP16 for the MHA block (Multi-Head Attention) and FP $16\ast $ INT4 for the FFN layer (Feed-Forward Network). Meanwhile specific optimization on log-scale structured weight sparsity, has been used to further increase the efficiency. Secondly, to address the compilation and deployment issue, we analyzed the whole operators within LLM models and developed a universal data parallelism scheme, by which all of the input and output features maintain the same data shape, enabling to process different operators without any data rearrangement. Then we proposed an end-to-end compiler to map the whole LLM model on CPU-FPGA heterogeneous system (AMD Xilinx VCU128 FPGA). The accelerator achieves $1.91\times $ higher throughput and $7.55\times $ higher energy efficiency than the commercial GPU (NVIDIA A100-SXM4-80G). When compared with state-of-the-art FPGA accelerator of FlightLLM, it shows 10-24% better performance in terms of HBM bandwidth utilization, energy efficiency and LLM throughput.
面向大型语言模型的高效CPU-FPGA异构边缘加速器
人工智能(AI)的快速发展,特别是大型语言模型(llm),已经深刻地影响了我们的日常工作和交流形式。然而,在资源受限的边缘设备(如机器人)上部署llm仍然是一个挑战,因为密集的计算需求、繁重的内存访问、不同的操作符类型和编译困难。在这项工作中,我们提出了EdgeLLM来解决上述问题。首先,在计算方面,我们设计了混合精度处理单元阵列和组压缩结构,可以有效地支持MHA块(多头注意)的FP $16\ast $ FP16和FFN层(前馈网络)的FP $16\ast $ INT4。同时,采用对数尺度的结构化权值稀疏度优化,进一步提高了效率。其次,为了解决编译和部署问题,我们对LLM模型中的整个运算符进行了分析,并开发了一种通用的数据并行方案,通过该方案,所有输入和输出特征保持相同的数据形状,从而可以处理不同的运算符而无需重新排列数据。然后,我们提出了一个端到端的编译器,将整个LLM模型映射到CPU-FPGA异构系统(AMD Xilinx VCU128 FPGA)上。该加速器实现了比商用GPU (NVIDIA A100-SXM4-80G)高1.91倍的吞吐量和7.55倍的能效。与FlightLLM最先进的FPGA加速器相比,在HBM带宽利用率、能效和LLM吞吐量方面的性能提高了10-24%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
IEEE Transactions on Circuits and Systems I: Regular Papers
IEEE Transactions on Circuits and Systems I: Regular Papers 工程技术-工程:电子与电气
CiteScore
9.80
自引率
11.80%
发文量
441
审稿时长
2 months
期刊介绍: TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信