一种新的基于大型语言模型的设备上人工智能定点仿真方法

IF 4.1 2区计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of Systems Architecture Pub Date : 2025-08-21 DOI:10.1016/j.sysarc.2025.103548

Jung-Woo Kim, Seung-Hwan Yoon, Dong-Kyeong Kang, Seong-Won Lim, Hak-Bum Lee, Su-Min Oh, Young-Ho Seo

{"title":"一种新的基于大型语言模型的设备上人工智能定点仿真方法","authors":"Jung-Woo Kim, Seung-Hwan Yoon, Dong-Kyeong Kang, Seong-Won Lim, Hak-Bum Lee, Su-Min Oh, Young-Ho Seo","doi":"10.1016/j.sysarc.2025.103548","DOIUrl":null,"url":null,"abstract":"<div><div>Large language models (LLMs) have demonstrated outstanding performance across various natural language processing tasks, and their utilization in on-device environments is gradually increasing. This paper proposes a full integer arithmetic (fixed-point arithmetic) methodology utilizing fixed-point simulation to optimize the LLaMA3-8B-Instruct model for on-device hardware development. The proposed approach enables integer computations without performance degradation on the MMLU benchmark. Conventional quantization methods primarily focus on integer conversion of weight matrix multiplication operations; however, they require subsequent floating-point restoration, which can lead to computational bottlenecks. In contrast, this paper eliminates such floating-point dependencies by converting all operations, including SoftMax, layer normalization and activation functions, into fixed-point integer formats. Furthermore, to maintain the accuracy in the integer computation, we partition the model’s computational graph into repeatable and one-to-one nodes (RONs) and hierarchically determine integer and fractional bit-widths, ensuring that the pre-trained parameters and the bit-width of constants and initial values used in inference are optimized. Experimental results show that the proposed approach maintains the same accuracy as the FP16/FP32 baseline while achieving up to a 84.67% reduction in hardware resource usage and approximately 16× inference speed-up, as analyzed using the Synopsys Design Compiler. This demonstrates that fully integer computation of LLMs can simultaneously achieve high performance and efficiency.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"168 ","pages":"Article 103548"},"PeriodicalIF":4.1000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A new fixed-point simulation methodology for on-device AI based on large language models\",\"authors\":\"Jung-Woo Kim, Seung-Hwan Yoon, Dong-Kyeong Kang, Seong-Won Lim, Hak-Bum Lee, Su-Min Oh, Young-Ho Seo\",\"doi\":\"10.1016/j.sysarc.2025.103548\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Large language models (LLMs) have demonstrated outstanding performance across various natural language processing tasks, and their utilization in on-device environments is gradually increasing. This paper proposes a full integer arithmetic (fixed-point arithmetic) methodology utilizing fixed-point simulation to optimize the LLaMA3-8B-Instruct model for on-device hardware development. The proposed approach enables integer computations without performance degradation on the MMLU benchmark. Conventional quantization methods primarily focus on integer conversion of weight matrix multiplication operations; however, they require subsequent floating-point restoration, which can lead to computational bottlenecks. In contrast, this paper eliminates such floating-point dependencies by converting all operations, including SoftMax, layer normalization and activation functions, into fixed-point integer formats. Furthermore, to maintain the accuracy in the integer computation, we partition the model’s computational graph into repeatable and one-to-one nodes (RONs) and hierarchically determine integer and fractional bit-widths, ensuring that the pre-trained parameters and the bit-width of constants and initial values used in inference are optimized. Experimental results show that the proposed approach maintains the same accuracy as the FP16/FP32 baseline while achieving up to a 84.67% reduction in hardware resource usage and approximately 16× inference speed-up, as analyzed using the Synopsys Design Compiler. This demonstrates that fully integer computation of LLMs can simultaneously achieve high performance and efficiency.</div></div>\",\"PeriodicalId\":50027,\"journal\":{\"name\":\"Journal of Systems Architecture\",\"volume\":\"168 \",\"pages\":\"Article 103548\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2025-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems Architecture\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1383762125002206\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762125002206","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（llm）在各种自然语言处理任务中表现出出色的性能，并且它们在设备上环境中的应用正在逐渐增加。本文提出了一种利用定点仿真优化llama3 - 8b - instruction模型的全整数算法（定点算法）方法，用于设备上硬件开发。所提出的方法在MMLU基准测试中支持整数计算，而不会降低性能。传统量化方法主要关注权矩阵乘法运算的整数转换；然而，它们需要随后的浮点恢复，这可能导致计算瓶颈。相反，本文通过将所有操作（包括SoftMax、层归一化和激活函数）转换为定点整数格式来消除这种浮点依赖。此外，为了保持整数计算的准确性，我们将模型的计算图划分为可重复的和一对一的节点（ron），并分层确定整数和分数比特宽度，确保预训练参数以及用于推理的常数和初始值的比特宽度得到优化。实验结果表明，该方法保持了与FP16/FP32基线相同的精度，同时实现了高达84.67%的硬件资源使用减少和大约16倍的推理加速，使用Synopsys Design Compiler进行了分析。这表明llm的全整数计算可以同时获得高性能和高效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A new fixed-point simulation methodology for on-device AI based on large language models

Large language models (LLMs) have demonstrated outstanding performance across various natural language processing tasks, and their utilization in on-device environments is gradually increasing. This paper proposes a full integer arithmetic (fixed-point arithmetic) methodology utilizing fixed-point simulation to optimize the LLaMA3-8B-Instruct model for on-device hardware development. The proposed approach enables integer computations without performance degradation on the MMLU benchmark. Conventional quantization methods primarily focus on integer conversion of weight matrix multiplication operations; however, they require subsequent floating-point restoration, which can lead to computational bottlenecks. In contrast, this paper eliminates such floating-point dependencies by converting all operations, including SoftMax, layer normalization and activation functions, into fixed-point integer formats. Furthermore, to maintain the accuracy in the integer computation, we partition the model’s computational graph into repeatable and one-to-one nodes (RONs) and hierarchically determine integer and fractional bit-widths, ensuring that the pre-trained parameters and the bit-width of constants and initial values used in inference are optimized. Experimental results show that the proposed approach maintains the same accuracy as the FP16/FP32 baseline while achieving up to a 84.67% reduction in hardware resource usage and approximately 16× inference speed-up, as analyzed using the Synopsys Design Compiler. This demonstrates that fully integer computation of LLMs can simultaneously achieve high performance and efficiency.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Systems Architecture 工程技术-计算机：硬件

CiteScore

8.70

自引率

15.60%

发文量

226

审稿时长

46 days

期刊介绍： The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software. Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.