HALO：在边缘执行高能效 LLM 的通信感知异构 2.5D 系统

IF 3.7 2区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Journal on Emerging and Selected Topics in Circuits and Systems Pub Date : 2024-07-12 DOI:10.1109/JETCAS.2024.3427421

Abhi Jaiswal;K. C. Sharin Shahana;Sujitha Ravichandran;K. Adarsh;H. Bharath Bhat;Biresh Kumar Joardar;Sumit K. Mandal

{"title":"HALO：在边缘执行高能效 LLM 的通信感知异构 2.5D 系统","authors":"Abhi Jaiswal;K. C. Sharin Shahana;Sujitha Ravichandran;K. Adarsh;H. Bharath Bhat;Biresh Kumar Joardar;Sumit K. Mandal","doi":"10.1109/JETCAS.2024.3427421","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) are used to perform various tasks, especially in the domain of natural language processing (NLP). State-of-the-art LLMs consist of a large number of parameters that necessitate a high volume of computations. Currently, GPUs are the preferred choice of hardware platform to execute LLM inference. However, monolithic GPU-based systems executing large LLMs pose significant drawbacks in terms of fabrication cost and energy efficiency. In this work, we propose a heterogeneous 2.5D chiplet-based architecture for accelerating LLM inference. The proposed 2.5D system consists of heterogeneous chiplets connected via a network-on-package (NoP). In the proposed 2.5D system, we leverage the energy efficiency of in-memory computing (IMC) and the general-purpose computing capability of CMOS-based floating point units (FPUs). The 2.5D technology helps to integrate two different technologies (IMC and CMOS) on the same system. Due to a large number of parameters, communication between chiplets becomes a significant performance bottleneck if not optimized while executing LLMs. To this end, we propose a communication-aware scalable technique to map different pieces of computations of an LLM onto different chiplets. The proposed mapping technique minimizes the communication energy and latency over the NoP, and is significantly faster than existing optimization techniques. Thorough experimental evaluations with a wide variety of LLMs show that the proposed 2.5D system provides up to \n<inline-formula> <tex-math>$972\\times $ </tex-math></inline-formula>\n improvement in latency and \n<inline-formula> <tex-math>$1600\\times $ </tex-math></inline-formula>\n improvement in energy consumption with respect to state-of-the-art edge devices equipped with GPU.","PeriodicalId":48827,"journal":{"name":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","volume":"14 3","pages":"425-439"},"PeriodicalIF":3.7000,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HALO: Communication-Aware Heterogeneous 2.5-D System for Energy-Efficient LLM Execution at Edge\",\"authors\":\"Abhi Jaiswal;K. C. Sharin Shahana;Sujitha Ravichandran;K. Adarsh;H. Bharath Bhat;Biresh Kumar Joardar;Sumit K. Mandal\",\"doi\":\"10.1109/JETCAS.2024.3427421\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) are used to perform various tasks, especially in the domain of natural language processing (NLP). State-of-the-art LLMs consist of a large number of parameters that necessitate a high volume of computations. Currently, GPUs are the preferred choice of hardware platform to execute LLM inference. However, monolithic GPU-based systems executing large LLMs pose significant drawbacks in terms of fabrication cost and energy efficiency. In this work, we propose a heterogeneous 2.5D chiplet-based architecture for accelerating LLM inference. The proposed 2.5D system consists of heterogeneous chiplets connected via a network-on-package (NoP). In the proposed 2.5D system, we leverage the energy efficiency of in-memory computing (IMC) and the general-purpose computing capability of CMOS-based floating point units (FPUs). The 2.5D technology helps to integrate two different technologies (IMC and CMOS) on the same system. Due to a large number of parameters, communication between chiplets becomes a significant performance bottleneck if not optimized while executing LLMs. To this end, we propose a communication-aware scalable technique to map different pieces of computations of an LLM onto different chiplets. The proposed mapping technique minimizes the communication energy and latency over the NoP, and is significantly faster than existing optimization techniques. Thorough experimental evaluations with a wide variety of LLMs show that the proposed 2.5D system provides up to \\n<inline-formula> <tex-math>$972\\\\times $ </tex-math></inline-formula>\\n improvement in latency and \\n<inline-formula> <tex-math>$1600\\\\times $ </tex-math></inline-formula>\\n improvement in energy consumption with respect to state-of-the-art edge devices equipped with GPU.\",\"PeriodicalId\":48827,\"journal\":{\"name\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"volume\":\"14 3\",\"pages\":\"425-439\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal on Emerging and Selected Topics in Circuits and Systems\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10596278/\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal on Emerging and Selected Topics in Circuits and Systems","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10596278/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLM）用于执行各种任务，尤其是在自然语言处理（NLP）领域。最先进的 LLM 包含大量参数，需要进行大量计算。目前，GPU 是执行 LLM 推理的首选硬件平台。然而，执行大型 LLM 的基于 GPU 的单片系统在制造成本和能效方面存在明显缺陷。在这项工作中，我们提出了一种基于 2.5D chiplet 的异构架构，用于加速 LLM 推断。所提出的 2.5D 系统由异构芯片组成，通过网络连接（NoP）。在拟议的 2.5D 系统中，我们充分利用了内存计算（IMC）的能效和基于 CMOS 的浮点运算单元（FPU）的通用计算能力。2.5D 技术有助于在同一系统上集成两种不同的技术（IMC 和 CMOS）。由于存在大量参数，如果在执行 LLM 时不对芯片间通信进行优化，那么芯片间通信就会成为一个重要的性能瓶颈。为此，我们提出了一种通信感知可扩展技术，将 LLM 的不同计算映射到不同的芯片上。所提出的映射技术最大限度地减少了 NoP 上的通信能量和延迟，速度明显快于现有的优化技术。对各种LLM进行的全面实验评估表明，与配备GPU的最先进边缘设备相比，所提出的2.5D系统在延迟和能耗方面分别提高了972倍和1600倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

HALO: Communication-Aware Heterogeneous 2.5-D System for Energy-Efficient LLM Execution at Edge

Large Language Models (LLMs) are used to perform various tasks, especially in the domain of natural language processing (NLP). State-of-the-art LLMs consist of a large number of parameters that necessitate a high volume of computations. Currently, GPUs are the preferred choice of hardware platform to execute LLM inference. However, monolithic GPU-based systems executing large LLMs pose significant drawbacks in terms of fabrication cost and energy efficiency. In this work, we propose a heterogeneous 2.5D chiplet-based architecture for accelerating LLM inference. The proposed 2.5D system consists of heterogeneous chiplets connected via a network-on-package (NoP). In the proposed 2.5D system, we leverage the energy efficiency of in-memory computing (IMC) and the general-purpose computing capability of CMOS-based floating point units (FPUs). The 2.5D technology helps to integrate two different technologies (IMC and CMOS) on the same system. Due to a large number of parameters, communication between chiplets becomes a significant performance bottleneck if not optimized while executing LLMs. To this end, we propose a communication-aware scalable technique to map different pieces of computations of an LLM onto different chiplets. The proposed mapping technique minimizes the communication energy and latency over the NoP, and is significantly faster than existing optimization techniques. Thorough experimental evaluations with a wide variety of LLMs show that the proposed 2.5D system provides up to

$972\times $

improvement in latency and

$1600\times $

improvement in energy consumption with respect to state-of-the-art edge devices equipped with GPU.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ENGINEERING, ELECTRICAL & ELECTRONIC-

CiteScore

8.50

自引率

2.20%

发文量

期刊介绍： The IEEE Journal on Emerging and Selected Topics in Circuits and Systems is published quarterly and solicits, with particular emphasis on emerging areas, special issues on topics that cover the entire scope of the IEEE Circuits and Systems (CAS) Society, namely the theory, analysis, design, tools, and implementation of circuits and systems, spanning their theoretical foundations, applications, and architectures for signal and information processing.