Luis Sant'Ana, Danilo Carastan-Santos, Daniel Cordeiro, R. Camargo
{"title":"Real-Time Scheduling Policy Selection from Queue and Machine States","authors":"Luis Sant'Ana, Danilo Carastan-Santos, Daniel Cordeiro, R. Camargo","doi":"10.1109/CCGRID.2019.00052","DOIUrl":null,"url":null,"abstract":"Task Scheduling in large-scale HPC platforms is normally accomplished with simple heuristics combined with a backfilling algorithm. Some strategies, such as the First-Come-First-Serve (FCFS) with backfilling, provide reasonable results in a variety of scenarios, including different HPC platforms and task set characteristics. But for each scenario, a different strategy might be the most appropriate for minimizing some metric, such as the average task waiting time or turnaround time. In this work, we present a real-time scheduling policy selection algorithm, which takes as input the running queue job characteristics and machine states. We evaluated the use of logistic regression and support-vector machines to perform the mapping from queue and machine state to selected scheduling policy. The machine learning algorithms are trained and evaluated using simulations configured using HPC platform traces. When selecting among 8 (eight) scheduling policies, we obtained an accuracy above 80%, when compared to the best selection. When simulating the online real-time selection of policies for a period of one year, we obtained a reduction in the mean queue waiting time of tasks of up to 40% over using FCFS and 10% over randomly selecting policies. Moreover, the method performed close the best possible selection of policies, with a maximum of 9% increase in the mean queue waiting time.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2019.00052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Task Scheduling in large-scale HPC platforms is normally accomplished with simple heuristics combined with a backfilling algorithm. Some strategies, such as the First-Come-First-Serve (FCFS) with backfilling, provide reasonable results in a variety of scenarios, including different HPC platforms and task set characteristics. But for each scenario, a different strategy might be the most appropriate for minimizing some metric, such as the average task waiting time or turnaround time. In this work, we present a real-time scheduling policy selection algorithm, which takes as input the running queue job characteristics and machine states. We evaluated the use of logistic regression and support-vector machines to perform the mapping from queue and machine state to selected scheduling policy. The machine learning algorithms are trained and evaluated using simulations configured using HPC platform traces. When selecting among 8 (eight) scheduling policies, we obtained an accuracy above 80%, when compared to the best selection. When simulating the online real-time selection of policies for a period of one year, we obtained a reduction in the mean queue waiting time of tasks of up to 40% over using FCFS and 10% over randomly selecting policies. Moreover, the method performed close the best possible selection of policies, with a maximum of 9% increase in the mean queue waiting time.