{"title":"EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models","authors":"Mingqiang Huang;Ao Shen;Kai Li;Haoxiang Peng;Boyu Li;Yupeng Su;Hao Yu","doi":"10.1109/TCSI.2025.3546256","DOIUrl":null,"url":null,"abstract":"The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, it is still a challenge to deploy LLMs on resource-constrained edge devices (such as robots), due to the intensive computation requirements, heavy memory access, diverse operator types and difficulties in compilation. In this work, we proposed EdgeLLM to address the above issues. Firstly, focusing on the computation, we designed mix-precision processing element array together with group systolic architecture, that can efficiently support both FP<inline-formula> <tex-math>$16\\ast $ </tex-math></inline-formula>FP16 for the MHA block (Multi-Head Attention) and FP<inline-formula> <tex-math>$16\\ast $ </tex-math></inline-formula>INT4 for the FFN layer (Feed-Forward Network). Meanwhile specific optimization on log-scale structured weight sparsity, has been used to further increase the efficiency. Secondly, to address the compilation and deployment issue, we analyzed the whole operators within LLM models and developed a universal data parallelism scheme, by which all of the input and output features maintain the same data shape, enabling to process different operators without any data rearrangement. Then we proposed an end-to-end compiler to map the whole LLM model on CPU-FPGA heterogeneous system (AMD Xilinx VCU128 FPGA). The accelerator achieves <inline-formula> <tex-math>$1.91\\times $ </tex-math></inline-formula> higher throughput and <inline-formula> <tex-math>$7.55\\times $ </tex-math></inline-formula> higher energy efficiency than the commercial GPU (NVIDIA A100-SXM4-80G). When compared with state-of-the-art FPGA accelerator of FlightLLM, it shows 10-24% better performance in terms of HBM bandwidth utilization, energy efficiency and LLM throughput.","PeriodicalId":13039,"journal":{"name":"IEEE Transactions on Circuits and Systems I: Regular Papers","volume":"72 7","pages":"3352-3365"},"PeriodicalIF":5.2000,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems I: Regular Papers","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10916480/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, it is still a challenge to deploy LLMs on resource-constrained edge devices (such as robots), due to the intensive computation requirements, heavy memory access, diverse operator types and difficulties in compilation. In this work, we proposed EdgeLLM to address the above issues. Firstly, focusing on the computation, we designed mix-precision processing element array together with group systolic architecture, that can efficiently support both FP$16\ast $ FP16 for the MHA block (Multi-Head Attention) and FP$16\ast $ INT4 for the FFN layer (Feed-Forward Network). Meanwhile specific optimization on log-scale structured weight sparsity, has been used to further increase the efficiency. Secondly, to address the compilation and deployment issue, we analyzed the whole operators within LLM models and developed a universal data parallelism scheme, by which all of the input and output features maintain the same data shape, enabling to process different operators without any data rearrangement. Then we proposed an end-to-end compiler to map the whole LLM model on CPU-FPGA heterogeneous system (AMD Xilinx VCU128 FPGA). The accelerator achieves $1.91\times $ higher throughput and $7.55\times $ higher energy efficiency than the commercial GPU (NVIDIA A100-SXM4-80G). When compared with state-of-the-art FPGA accelerator of FlightLLM, it shows 10-24% better performance in terms of HBM bandwidth utilization, energy efficiency and LLM throughput.
期刊介绍:
TCAS I publishes regular papers in the field specified by the theory, analysis, design, and practical implementations of circuits, and the application of circuit techniques to systems and to signal processing. Included is the whole spectrum from basic scientific theory to industrial applications. The field of interest covered includes: - Circuits: Analog, Digital and Mixed Signal Circuits and Systems - Nonlinear Circuits and Systems, Integrated Sensors, MEMS and Systems on Chip, Nanoscale Circuits and Systems, Optoelectronic - Circuits and Systems, Power Electronics and Systems - Software for Analog-and-Logic Circuits and Systems - Control aspects of Circuits and Systems.