SmartQuant：基于 CXL 的人工智能模型存储，支持运行时可配置的权重量化

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters Pub Date : 2024-09-02 DOI:10.1109/LCA.2024.3452699

Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang

{"title":"SmartQuant：基于 CXL 的人工智能模型存储，支持运行时可配置的权重量化","authors":"Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang","doi":"10.1109/LCA.2024.3452699","DOIUrl":null,"url":null,"abstract":"Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"199-202"},"PeriodicalIF":1.4000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SmartQuant: CXL-Based AI Model Store in Support of Runtime Configurable Weight Quantization\",\"authors\":\"Rui Xie;Asad Ul Haq;Linsen Ma;Krystal Sun;Sanchari Sen;Swagath Venkataramani;Liu Liu;Tong Zhang\",\"doi\":\"10.1109/LCA.2024.3452699\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.\",\"PeriodicalId\":51248,\"journal\":{\"name\":\"IEEE Computer Architecture Letters\",\"volume\":\"23 2\",\"pages\":\"199-202\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2024-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Computer Architecture Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10663202/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Computer Architecture Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10663202/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

最近的研究发现，在生成式人工智能模型（如变压器）的推理过程中，不同权重的重要性会表现出很大的上下文依赖性变化。这自然体现了自适应配置权重量化以提高生成式人工智能推理效率的巨大潜力。虽然可配置的权重量化可以轻松利用现代 GPU 和人工智能加速器对可变精度算术的硬件支持，但之前的研究很少涉及如何利用可变权重量化成比例地提高人工智能模型的内存访问速度和能效。在快速成熟的 CXL 生态系统的推动下，这项工作开发了一种基于 CXL 的设计解决方案，以填补这一空白。关键是让 CXL 内存控制器在支持和利用运行时可配置权重量化方面发挥积极作用。我们使用变压器作为具有代表性的生成式人工智能模型进行了实验，很好地证明了所提设计方案的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SmartQuant: CXL-Based AI Model Store in Support of Runtime Configurable Weight Quantization

Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization. Using transformer as a representative generative AI model, we carried out experiments that well demonstrate the effectiveness of the proposed design solution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Computer Architecture Letters COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

CiteScore

4.60

自引率

4.30%

发文量

期刊介绍： IEEE Computer Architecture Letters is a rigorously peer-reviewed forum for publishing early, high-impact results in the areas of uni- and multiprocessor computer systems, computer architecture, microarchitecture, workload characterization, performance evaluation and simulation techniques, and power-aware computing. Submissions are welcomed on any topic in computer architecture, especially but not limited to: microprocessor and multiprocessor systems, microarchitecture and ILP processors, workload characterization, performance evaluation and simulation techniques, compiler-hardware and operating system-hardware interactions, interconnect architectures, memory and cache systems, power and thermal issues at the architecture level, I/O architectures and techniques, independent validation of previously published results, analysis of unsuccessful techniques, domain-specific processor architectures (e.g., embedded, graphics, network, etc.), real-time and high-availability architectures, reconfigurable systems.