Abhi Jaiswal;K. C. Sharin Shahana;Sujitha Ravichandran;K. Adarsh;H. Bharath Bhat;Biresh Kumar Joardar;Sumit K. Mandal
{"title":"HALO: Communication-Aware Heterogeneous 2.5-D System for Energy-Efficient LLM Execution at Edge","authors":"Abhi Jaiswal;K. C. Sharin Shahana;Sujitha Ravichandran;K. Adarsh;H. Bharath Bhat;Biresh Kumar Joardar;Sumit K. Mandal","doi":"10.1109/JETCAS.2024.3427421","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) are used to perform various tasks, especially in the domain of natural language processing (NLP). State-of-the-art LLMs consist of a large number of parameters that necessitate a high volume of computations. Currently, GPUs are the preferred choice of hardware platform to execute LLM inference. However, monolithic GPU-based systems executing large LLMs pose significant drawbacks in terms of fabrication cost and energy efficiency. In this work, we propose a heterogeneous 2.5D chiplet-based architecture for accelerating LLM inference. The proposed 2.5D system consists of heterogeneous chiplets connected via a network-on-package (NoP). In the proposed 2.5D system, we leverage the energy efficiency of in-memory computing (IMC) and the general-purpose computing capability of CMOS-based floating point units (FPUs). The 2.5D technology helps to integrate two different technologies (IMC and CMOS) on the same system. Due to a large number of parameters, communication between chiplets becomes a significant performance bottleneck if not optimized while executing LLMs. To this end, we propose a communication-aware scalable technique to map different pieces of computations of an LLM onto different chiplets. The proposed mapping technique minimizes the communication energy and latency over the NoP, and is significantly faster than existing optimization techniques. Thorough experimental evaluations with a wide variety of LLMs show that the proposed 2.5D system provides up to \n<inline-formula> <tex-math>$972\\times $ </tex-math></inline-formula>\n improvement in latency and \n<inline-formula> <tex-math>$1600\\times $ </tex-math></inline-formula>\n improvement in energy consumption with respect to state-of-the-art edge devices equipped with GPU.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"425-439"},"PeriodicalIF":3.7000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10596278/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Large Language Models (LLMs) are used to perform various tasks, especially in the domain of natural language processing (NLP). State-of-the-art LLMs consist of a large number of parameters that necessitate a high volume of computations. Currently, GPUs are the preferred choice of hardware platform to execute LLM inference. However, monolithic GPU-based systems executing large LLMs pose significant drawbacks in terms of fabrication cost and energy efficiency. In this work, we propose a heterogeneous 2.5D chiplet-based architecture for accelerating LLM inference. The proposed 2.5D system consists of heterogeneous chiplets connected via a network-on-package (NoP). In the proposed 2.5D system, we leverage the energy efficiency of in-memory computing (IMC) and the general-purpose computing capability of CMOS-based floating point units (FPUs). The 2.5D technology helps to integrate two different technologies (IMC and CMOS) on the same system. Due to a large number of parameters, communication between chiplets becomes a significant performance bottleneck if not optimized while executing LLMs. To this end, we propose a communication-aware scalable technique to map different pieces of computations of an LLM onto different chiplets. The proposed mapping technique minimizes the communication energy and latency over the NoP, and is significantly faster than existing optimization techniques. Thorough experimental evaluations with a wide variety of LLMs show that the proposed 2.5D system provides up to
$972\times $
improvement in latency and
$1600\times $
improvement in energy consumption with respect to state-of-the-art edge devices equipped with GPU.
期刊介绍:
The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.