Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter
{"title":"How to Rent GPUs on a Budget","authors":"Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter","doi":"arxiv-2406.15560","DOIUrl":null,"url":null,"abstract":"The explosion in Machine Learning (ML) over the past ten years has led to a\ndramatic increase in demand for GPUs to train ML models. Because it is\nprohibitively expensive for most users to build and maintain a large GPU\ncluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have\nseen explosive growth in demand for renting cloud-based GPUs. In this\ncloud-computing paradigm, a user must specify their demand for GPUs at every\nmoment in time, and will pay for every GPU-hour they use. ML training jobs are\nknown to be parallelizable to different degrees. Given a stream of ML training\njobs, a user typically wants to minimize the mean response time across all\njobs. Here, the response time of a job denotes the time from when a job arrives\nuntil it is complete. Additionally, the user is constrained by some operating\nbudget. Specifically, in this paper the user is constrained to use no more than\n$b$ GPUs per hour, over a long-run time average. The question is how to\nminimize mean response time while meeting the budget constraint. Because\ntraining jobs receive a diminishing marginal benefit from running on additional\nGPUs, allocating too many GPUs to a single training job can dramatically\nincrease the overall cost paid by the user. Hence, an optimal rental policy\nmust balance a tradeoff between training cost and mean response time. This\npaper derives the optimal rental policy for a stream of training jobs where the\njobs have different levels of parallelizability (specified by a speedup\nfunction) and different job sizes (amounts of inherent work). We make almost no\nassumptions about the arrival process and about the job size distribution. Our\noptimal policy specifies how many GPUs to rent at every moment in time and how\nto allocate these GPUs.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"56 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.15560","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The explosion in Machine Learning (ML) over the past ten years has led to a
dramatic increase in demand for GPUs to train ML models. Because it is
prohibitively expensive for most users to build and maintain a large GPU
cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have
seen explosive growth in demand for renting cloud-based GPUs. In this
cloud-computing paradigm, a user must specify their demand for GPUs at every
moment in time, and will pay for every GPU-hour they use. ML training jobs are
known to be parallelizable to different degrees. Given a stream of ML training
jobs, a user typically wants to minimize the mean response time across all
jobs. Here, the response time of a job denotes the time from when a job arrives
until it is complete. Additionally, the user is constrained by some operating
budget. Specifically, in this paper the user is constrained to use no more than
$b$ GPUs per hour, over a long-run time average. The question is how to
minimize mean response time while meeting the budget constraint. Because
training jobs receive a diminishing marginal benefit from running on additional
GPUs, allocating too many GPUs to a single training job can dramatically
increase the overall cost paid by the user. Hence, an optimal rental policy
must balance a tradeoff between training cost and mean response time. This
paper derives the optimal rental policy for a stream of training jobs where the
jobs have different levels of parallelizability (specified by a speedup
function) and different job sizes (amounts of inherent work). We make almost no
assumptions about the arrival process and about the job size distribution. Our
optimal policy specifies how many GPUs to rent at every moment in time and how
to allocate these GPUs.