{"title":"Improving Throughput-oriented Generative Inference with CPUs","authors":"Daon Park, Sungbin Jo, Bernhard Egger","doi":"10.1145/3609510.3609815","DOIUrl":null,"url":null,"abstract":"Despite recent attempts to reduce the number of parameters of large language models (LLMs), their parameter data is still too large to fit into a single GPU. With the emergence of throughput-oriented tasks, high-throughput generative inference frameworks for LLMs on a single commodity GPU leverage GPU, DRAM, and NVMe to run inference on large models with terabytes of data. Our analysis of the technique shows that the runtime is dominated by data transfers of the weights, leading to a low utilization of both the GPU and the CPU. In this paper, we increase the throughput and decrease the total latency of state-of-the-art frameworks by including the CPU as a compute device and overlapping computations on the CPU with GPU data transfers. Our work shows a promising improvement of around 40% in throughput and total latency, with potential room for further improvements.","PeriodicalId":149629,"journal":{"name":"Proceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3609510.3609815","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Despite recent attempts to reduce the number of parameters of large language models (LLMs), their parameter data is still too large to fit into a single GPU. With the emergence of throughput-oriented tasks, high-throughput generative inference frameworks for LLMs on a single commodity GPU leverage GPU, DRAM, and NVMe to run inference on large models with terabytes of data. Our analysis of the technique shows that the runtime is dominated by data transfers of the weights, leading to a low utilization of both the GPU and the CPU. In this paper, we increase the throughput and decrease the total latency of state-of-the-art frameworks by including the CPU as a compute device and overlapping computations on the CPU with GPU data transfers. Our work shows a promising improvement of around 40% in throughput and total latency, with potential room for further improvements.