{"title":"ISO:用于 LLM 推断的序列内计算与通信的重叠","authors":"Bin Xiao, Lei Su","doi":"arxiv-2409.11155","DOIUrl":null,"url":null,"abstract":"In the realm of Large Language Model (LLM) inference, the inherent structure\nof transformer models coupled with the multi-GPU tensor parallelism strategy\nleads to a sequential execution of computation and communication. This results\nin substantial underutilization of computing resources during the communication\nphase. To mitigate this inefficiency, various techniques have been developed to\noptimize the use of computational power throughout the communication process.\nThese strategies primarily involve overlapping matrix computations and\ncommunications, as well as interleaving micro-batches across different\nrequests. Nonetheless, these approaches either fall short of achieving ideal\noverlap or impose certain limitations on their application. To overcome these\nchallenges, this paper introduces a novel strategy for\ncomputation-communication overlap that operates at the sequence level. This\nmethod not only enhances the degree of overlap but also minimizes the\nconstraints on its applicability. Experimental evaluations conducted using\n30b/70b models have demonstrated significant improvements in efficiency.\nSpecifically, the proposed technique has been shown to reduce time consumption\nby approximately 35% on 4090 GPU and by roughly 15% on A800 GPU during the\nprefill stage of LLM inference.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ISO: Overlap of Computation and Communication within Seqenence For LLM Inference\",\"authors\":\"Bin Xiao, Lei Su\",\"doi\":\"arxiv-2409.11155\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the realm of Large Language Model (LLM) inference, the inherent structure\\nof transformer models coupled with the multi-GPU tensor parallelism strategy\\nleads to a sequential execution of computation and communication. This results\\nin substantial underutilization of computing resources during the communication\\nphase. To mitigate this inefficiency, various techniques have been developed to\\noptimize the use of computational power throughout the communication process.\\nThese strategies primarily involve overlapping matrix computations and\\ncommunications, as well as interleaving micro-batches across different\\nrequests. Nonetheless, these approaches either fall short of achieving ideal\\noverlap or impose certain limitations on their application. To overcome these\\nchallenges, this paper introduces a novel strategy for\\ncomputation-communication overlap that operates at the sequence level. This\\nmethod not only enhances the degree of overlap but also minimizes the\\nconstraints on its applicability. Experimental evaluations conducted using\\n30b/70b models have demonstrated significant improvements in efficiency.\\nSpecifically, the proposed technique has been shown to reduce time consumption\\nby approximately 35% on 4090 GPU and by roughly 15% on A800 GPU during the\\nprefill stage of LLM inference.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11155\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11155","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ISO: Overlap of Computation and Communication within Seqenence For LLM Inference
In the realm of Large Language Model (LLM) inference, the inherent structure
of transformer models coupled with the multi-GPU tensor parallelism strategy
leads to a sequential execution of computation and communication. This results
in substantial underutilization of computing resources during the communication
phase. To mitigate this inefficiency, various techniques have been developed to
optimize the use of computational power throughout the communication process.
These strategies primarily involve overlapping matrix computations and
communications, as well as interleaving micro-batches across different
requests. Nonetheless, these approaches either fall short of achieving ideal
overlap or impose certain limitations on their application. To overcome these
challenges, this paper introduces a novel strategy for
computation-communication overlap that operates at the sequence level. This
method not only enhances the degree of overlap but also minimizes the
constraints on its applicability. Experimental evaluations conducted using
30b/70b models have demonstrated significant improvements in efficiency.
Specifically, the proposed technique has been shown to reduce time consumption
by approximately 35% on 4090 GPU and by roughly 15% on A800 GPU during the
prefill stage of LLM inference.