Pedro Mendes, Maria Casimiro, P. Romano, D. Garlan
{"title":"TrimTuner: Efficient Optimization of Machine Learning Jobs in the Cloud via Sub-Sampling","authors":"Pedro Mendes, Maria Casimiro, P. Romano, D. Garlan","doi":"10.1109/MASCOTS50786.2020.9285971","DOIUrl":"https://doi.org/10.1109/MASCOTS50786.2020.9285971","url":null,"abstract":"This work introduces TrimTuner, the first system for optimizing machine learning jobs in the cloud to exploit sub-sampling techniques to reduce the cost of the optimization process, while keeping into account user-specified constraints. TrimTuner jointly optimizes the cloud and application-specific parameters and, unlike state of the art works for cloud optimization, eschews the need to train the model with the full training set every time a new configuration is sampled. Indeed, by leveraging sub-sampling techniques and data-sets that are up to 60 x smaller than the original one, we show that TrimTuner can reduce the cost of the optimization process by up to 50 x. Further, TrimTuner speeds-up the recommendation process by 65 x with respect to state of the art techniques for hyperparameter optimization that use sub-sampling techniques. The reasons for this improvement are twofold: i) a novel domain specific heuristic that reduces the number of configurations for which the acquisition function has to be evaluated; ii) the adoption of an ensemble of decision trees that enables boosting the speed of the recommendation process by one additional order of magnitude.","PeriodicalId":272614,"journal":{"name":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134598993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"COCOA: Cold Start Aware Capacity Planning for Function-as-a-Service Platforms","authors":"Alim Ul Gias, G. Casale","doi":"10.1109/MASCOTS50786.2020.9285966","DOIUrl":"https://doi.org/10.1109/MASCOTS50786.2020.9285966","url":null,"abstract":"Function-as-a-Service (FaaS) has become increasingly popular in the software industry due to the implied cost-savings in event-driven workloads and its synergy with DevOps. To size an on-premise FaaS platform, it is important to estimate the required CPU and memory capacity to serve the expected loads. Given the service-level agreements, it is however challenging to take the cold start issue into account during the sizing process. We have investigated the similarity of this problem with the hit rate improvement problem in Time to Live (TTL) caches and concluded that solutions for TTL cache, although potentially applicable, lead to over-provisioning in FaaS. Thus, we propose a novel approach, COCOA, to solve this issue. COCOA uses a queueing-based approach to assess the effect of cold starts on FaaS response times. It also considers different memory consumption values depending on whether the function is idle or in execution. Using an event-driven FaaS simulator, FaasSim, that we have developed, we show that COCOA can reduce overprovisioning by over 70% under some of the workloads we have considered, while satisfying the service-level agreements.","PeriodicalId":272614,"journal":{"name":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128713224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vaibhav Saxena, K. R. Jayaram, Saurav Basu, Yogish Sabharwal, Ashish Verma
{"title":"Effective Elastic Scaling of Deep Learning Workloads","authors":"Vaibhav Saxena, K. R. Jayaram, Saurav Basu, Yogish Sabharwal, Ashish Verma","doi":"10.1109/MASCOTS50786.2020.9285954","DOIUrl":"https://doi.org/10.1109/MASCOTS50786.2020.9285954","url":null,"abstract":"We examine the elastic scaling of Deep Learning (DL) jobs and propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization. We begin by analyzing DL workloads and exploit the fact that DL jobs can be run with a range of batch sizes without affecting their final accuracy. We formulate an optimization problem that explores a dynamic batch size allocation to individual DL jobs based on their scaling efficiency, when running on multiple nodes. We design a fast dynamic programming based optimizer to solve this problem in real-time to determine jobs that can be scaled up/down, and use this optimizer in an autoscaler to dynamically change the allocated resources and batch sizes of individual DL jobs. We demonstrate empirically that our elastic scaling algorithm can complete up to as many jobs as compared to a strong baseline algorithm that also scales the number of GPUs but does not change the batch size, with average completion times up to faster.","PeriodicalId":272614,"journal":{"name":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122184131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Smart Background Scheduler for Storage Systems","authors":"Maher Kachmar, D. Kaeli","doi":"10.1109/MASCOTS50786.2020.9285967","DOIUrl":"https://doi.org/10.1109/MASCOTS50786.2020.9285967","url":null,"abstract":"In today's enterprise storage systems, supported data services such as snapshot delete or drive rebuild can result in tremendous performance overhead if executed inline along with heavy foreground IO, often leading to missing Service Level Objectives (SLOs). Typical storage system applications such as Virtual Desktop Infrastructure (VDI) or web services follow a repetitive high/low workload pattern that can be learned and forecasted. We propose a priority-based background scheduler that learns this pattern and allows storage systems to maintain peak performance and meet service level objectives (SLOs) while supporting a number of data services. When foreground IO demand intensifies, system resources are dedicated to service foreground IO requests and any background processing that can be deferred are recorded to be processed in future idle cycles as long as our forecaster predicts that the storage pool has remaining capacity. The smart background scheduler adopts a resource partitioning model that allows both foreground and background IO to execute together as long as foreground IOs are not impacted, harnessing any free cycles to clear background debt. Using traces from VDI and web services applications, we show how our technique can out-perform a static policy that sets fixed limits on the deferred background debt and reduces SLO violations from 54.6% (when using a fixed background debt watermark), to only 6.2 % when dynamically adjusted by our smart background scheduler.","PeriodicalId":272614,"journal":{"name":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128368277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Age of Information in an Overtake- Free Network of Quasi - Reversible Queues","authors":"I. Koukoutsidis","doi":"10.1109/MASCOTS50786.2020.9285958","DOIUrl":"https://doi.org/10.1109/MASCOTS50786.2020.9285958","url":null,"abstract":"We show how to calculate the Age of Information in an overtake-free network of quasi-reversible queues, with exponential exogenous interarrivals of multiple classes of update packets and exponential service times at all nodes. Results are provided for any number of M/M/1 First-Come-First-Served (FCFS) queues in tandem, and for a network with two classes of update packets, entering through different queues in the network and exiting through the same queue. The main takeaway is that in a network with different classes of update packets, individual classes roughly preserve the ages they would achieve if they were alone in the network, except when shared queues become saturated, in which case the ages increase considerably. The results are extensible for other quasi-reversible queues for which sojourn time distributions are known, such as M/M/c FCFS queues and processor-sharing queues.","PeriodicalId":272614,"journal":{"name":"2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)","volume":"9 11","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131436457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}