Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li
{"title":"Missile: Fine-Grained, Hardware-Level GPU Resource Isolation for Multi-Tenant DNN Inference","authors":"Yongkang Zhang, Haoxuan Yu, Chenxia Han, Cheng Wang, Baotong Lu, Yang Li, Xiaowen Chu, Huaicheng Li","doi":"arxiv-2407.13996","DOIUrl":null,"url":null,"abstract":"Colocating high-priority, latency-sensitive (LS) and low-priority,\nbest-effort (BE) DNN inference services reduces the total cost of ownership\n(TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts\nand PCIe bus contentions, existing GPU sharing solutions are unable to avoid\nresource conflicts among concurrently executing tasks, failing to achieve both\nlow latency for LS tasks and high throughput for BE tasks. To bridge this gap,\nthis paper presents Missile, a general GPU sharing solution for multi-tenant\nDNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware\nresource isolation between multiple LS and BE DNN tasks at software level.\nThrough comprehensive reverse engineering, Missile first reveals a general VRAM\nchannel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel\nconflicts using software-level cache coloring. It also isolates the PCIe bus\nand fairly allocates PCIe bandwidth using completely fair scheduler. We\nevaluate 12 mainstream DNNs with synthetic and real-world workloads on four\nGPUs. The results show that compared to the state-of-the-art GPU sharing\nsolutions, Missile reduces tail latency for LS services by up to ~50%, achieves\nup to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants\non-demand for optimal performance.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"2013 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.13996","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Colocating high-priority, latency-sensitive (LS) and low-priority,
best-effort (BE) DNN inference services reduces the total cost of ownership
(TCO) of GPU clusters. Limited by bottlenecks such as VRAM channel conflicts
and PCIe bus contentions, existing GPU sharing solutions are unable to avoid
resource conflicts among concurrently executing tasks, failing to achieve both
low latency for LS tasks and high throughput for BE tasks. To bridge this gap,
this paper presents Missile, a general GPU sharing solution for multi-tenant
DNN inference on NVIDIA GPUs. Missile approximates fine-grained GPU hardware
resource isolation between multiple LS and BE DNN tasks at software level.
Through comprehensive reverse engineering, Missile first reveals a general VRAM
channel hash mapping architecture of NVIDIA GPUs and eliminates VRAM channel
conflicts using software-level cache coloring. It also isolates the PCIe bus
and fairly allocates PCIe bandwidth using completely fair scheduler. We
evaluate 12 mainstream DNNs with synthetic and real-world workloads on four
GPUs. The results show that compared to the state-of-the-art GPU sharing
solutions, Missile reduces tail latency for LS services by up to ~50%, achieves
up to 6.1x BE job throughput, and allocates PCIe bus bandwidth to tenants
on-demand for optimal performance.