{"title":"PowerInfer:使用消费级 GPU 快速处理大型语言模型","authors":"Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen","doi":"arxiv-2312.12456","DOIUrl":null,"url":null,"abstract":"This paper introduces PowerInfer, a high-speed Large Language Model (LLM)\ninference engine on a personal computer (PC) equipped with a single\nconsumer-grade GPU. The key underlying the design of PowerInfer is exploiting\nthe high locality inherent in LLM inference, characterized by a power-law\ndistribution in neuron activation. This distribution indicates that a small\nsubset of neurons, termed hot neurons, are consistently activated across\ninputs, while the majority, cold neurons, vary based on specific inputs.\nPowerInfer exploits such an insight to design a GPU-CPU hybrid inference\nengine: hot-activated neurons are preloaded onto the GPU for fast access, while\ncold-activated neurons are computed on the CPU, thus significantly reducing GPU\nmemory demands and CPU-GPU data transfers. PowerInfer further integrates\nadaptive predictors and neuron-aware sparse operators, optimizing the\nefficiency of neuron activation and computational sparsity. Evaluation shows\nthat PowerInfer attains an average token generation rate of 13.20 tokens/s,\nwith a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a\nsingle NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier\nserver-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x\nwhile retaining model accuracy.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"58 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU\",\"authors\":\"Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen\",\"doi\":\"arxiv-2312.12456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper introduces PowerInfer, a high-speed Large Language Model (LLM)\\ninference engine on a personal computer (PC) equipped with a single\\nconsumer-grade GPU. The key underlying the design of PowerInfer is exploiting\\nthe high locality inherent in LLM inference, characterized by a power-law\\ndistribution in neuron activation. This distribution indicates that a small\\nsubset of neurons, termed hot neurons, are consistently activated across\\ninputs, while the majority, cold neurons, vary based on specific inputs.\\nPowerInfer exploits such an insight to design a GPU-CPU hybrid inference\\nengine: hot-activated neurons are preloaded onto the GPU for fast access, while\\ncold-activated neurons are computed on the CPU, thus significantly reducing GPU\\nmemory demands and CPU-GPU data transfers. PowerInfer further integrates\\nadaptive predictors and neuron-aware sparse operators, optimizing the\\nefficiency of neuron activation and computational sparsity. Evaluation shows\\nthat PowerInfer attains an average token generation rate of 13.20 tokens/s,\\nwith a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a\\nsingle NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier\\nserver-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x\\nwhile retaining model accuracy.\",\"PeriodicalId\":501333,\"journal\":{\"name\":\"arXiv - CS - Operating Systems\",\"volume\":\"58 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Operating Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2312.12456\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2312.12456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
This paper introduces PowerInfer, a high-speed Large Language Model (LLM)
inference engine on a personal computer (PC) equipped with a single
consumer-grade GPU. The key underlying the design of PowerInfer is exploiting
the high locality inherent in LLM inference, characterized by a power-law
distribution in neuron activation. This distribution indicates that a small
subset of neurons, termed hot neurons, are consistently activated across
inputs, while the majority, cold neurons, vary based on specific inputs.
PowerInfer exploits such an insight to design a GPU-CPU hybrid inference
engine: hot-activated neurons are preloaded onto the GPU for fast access, while
cold-activated neurons are computed on the CPU, thus significantly reducing GPU
memory demands and CPU-GPU data transfers. PowerInfer further integrates
adaptive predictors and neuron-aware sparse operators, optimizing the
efficiency of neuron activation and computational sparsity. Evaluation shows
that PowerInfer attains an average token generation rate of 13.20 tokens/s,
with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a
single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier
server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x
while retaining model accuracy.