Jung-Woo Kim, Seung-Hwan Yoon, Dong-Kyeong Kang, Seong-Won Lim, Hak-Bum Lee, Su-Min Oh, Young-Ho Seo
{"title":"一种新的基于大型语言模型的设备上人工智能定点仿真方法","authors":"Jung-Woo Kim, Seung-Hwan Yoon, Dong-Kyeong Kang, Seong-Won Lim, Hak-Bum Lee, Su-Min Oh, Young-Ho Seo","doi":"10.1016/j.sysarc.2025.103548","DOIUrl":null,"url":null,"abstract":"<div><div>Large language models (LLMs) have demonstrated outstanding performance across various natural language processing tasks, and their utilization in on-device environments is gradually increasing. This paper proposes a full integer arithmetic (fixed-point arithmetic) methodology utilizing fixed-point simulation to optimize the LLaMA3-8B-Instruct model for on-device hardware development. The proposed approach enables integer computations without performance degradation on the MMLU benchmark. Conventional quantization methods primarily focus on integer conversion of weight matrix multiplication operations; however, they require subsequent floating-point restoration, which can lead to computational bottlenecks. In contrast, this paper eliminates such floating-point dependencies by converting all operations, including SoftMax, layer normalization and activation functions, into fixed-point integer formats. Furthermore, to maintain the accuracy in the integer computation, we partition the model’s computational graph into repeatable and one-to-one nodes (RONs) and hierarchically determine integer and fractional bit-widths, ensuring that the pre-trained parameters and the bit-width of constants and initial values used in inference are optimized. Experimental results show that the proposed approach maintains the same accuracy as the FP16/FP32 baseline while achieving up to a 84.67% reduction in hardware resource usage and approximately 16× inference speed-up, as analyzed using the Synopsys Design Compiler. This demonstrates that fully integer computation of LLMs can simultaneously achieve high performance and efficiency.</div></div>","PeriodicalId":50027,"journal":{"name":"Journal of Systems Architecture","volume":"168 ","pages":"Article 103548"},"PeriodicalIF":4.1000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A new fixed-point simulation methodology for on-device AI based on large language models\",\"authors\":\"Jung-Woo Kim, Seung-Hwan Yoon, Dong-Kyeong Kang, Seong-Won Lim, Hak-Bum Lee, Su-Min Oh, Young-Ho Seo\",\"doi\":\"10.1016/j.sysarc.2025.103548\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Large language models (LLMs) have demonstrated outstanding performance across various natural language processing tasks, and their utilization in on-device environments is gradually increasing. This paper proposes a full integer arithmetic (fixed-point arithmetic) methodology utilizing fixed-point simulation to optimize the LLaMA3-8B-Instruct model for on-device hardware development. The proposed approach enables integer computations without performance degradation on the MMLU benchmark. Conventional quantization methods primarily focus on integer conversion of weight matrix multiplication operations; however, they require subsequent floating-point restoration, which can lead to computational bottlenecks. In contrast, this paper eliminates such floating-point dependencies by converting all operations, including SoftMax, layer normalization and activation functions, into fixed-point integer formats. Furthermore, to maintain the accuracy in the integer computation, we partition the model’s computational graph into repeatable and one-to-one nodes (RONs) and hierarchically determine integer and fractional bit-widths, ensuring that the pre-trained parameters and the bit-width of constants and initial values used in inference are optimized. Experimental results show that the proposed approach maintains the same accuracy as the FP16/FP32 baseline while achieving up to a 84.67% reduction in hardware resource usage and approximately 16× inference speed-up, as analyzed using the Synopsys Design Compiler. This demonstrates that fully integer computation of LLMs can simultaneously achieve high performance and efficiency.</div></div>\",\"PeriodicalId\":50027,\"journal\":{\"name\":\"Journal of Systems Architecture\",\"volume\":\"168 \",\"pages\":\"Article 103548\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2025-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systems Architecture\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1383762125002206\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systems Architecture","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1383762125002206","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
A new fixed-point simulation methodology for on-device AI based on large language models
Large language models (LLMs) have demonstrated outstanding performance across various natural language processing tasks, and their utilization in on-device environments is gradually increasing. This paper proposes a full integer arithmetic (fixed-point arithmetic) methodology utilizing fixed-point simulation to optimize the LLaMA3-8B-Instruct model for on-device hardware development. The proposed approach enables integer computations without performance degradation on the MMLU benchmark. Conventional quantization methods primarily focus on integer conversion of weight matrix multiplication operations; however, they require subsequent floating-point restoration, which can lead to computational bottlenecks. In contrast, this paper eliminates such floating-point dependencies by converting all operations, including SoftMax, layer normalization and activation functions, into fixed-point integer formats. Furthermore, to maintain the accuracy in the integer computation, we partition the model’s computational graph into repeatable and one-to-one nodes (RONs) and hierarchically determine integer and fractional bit-widths, ensuring that the pre-trained parameters and the bit-width of constants and initial values used in inference are optimized. Experimental results show that the proposed approach maintains the same accuracy as the FP16/FP32 baseline while achieving up to a 84.67% reduction in hardware resource usage and approximately 16× inference speed-up, as analyzed using the Synopsys Design Compiler. This demonstrates that fully integer computation of LLMs can simultaneously achieve high performance and efficiency.
期刊介绍:
The Journal of Systems Architecture: Embedded Software Design (JSA) is a journal covering all design and architectural aspects related to embedded systems and software. It ranges from the microarchitecture level via the system software level up to the application-specific architecture level. Aspects such as real-time systems, operating systems, FPGA programming, programming languages, communications (limited to analysis and the software stack), mobile systems, parallel and distributed architectures as well as additional subjects in the computer and system architecture area will fall within the scope of this journal. Technology will not be a main focus, but its use and relevance to particular designs will be. Case studies are welcome but must contribute more than just a design for a particular piece of software.
Design automation of such systems including methodologies, techniques and tools for their design as well as novel designs of software components fall within the scope of this journal. Novel applications that use embedded systems are also central in this journal. While hardware is not a part of this journal hardware/software co-design methods that consider interplay between software and hardware components with and emphasis on software are also relevant here.