{"title":"Toward Smart Scheduling in Tapis","authors":"Joe Stubbs, Smruti Padhy, Richard Cardone","doi":"arxiv-2408.03349","DOIUrl":null,"url":null,"abstract":"The Tapis framework provides APIs for automating job execution on remote\nresources, including HPC clusters and servers running in the cloud. Tapis can\nsimplify the interaction with remote cyberinfrastructure (CI), but the current\nservices require users to specify the exact configuration of a job to run,\nincluding the system, queue, node count, and maximum run time, among other\nattributes. Moreover, the remote resources must be defined and configured in\nTapis before a job can be submitted. In this paper, we present our efforts to\ndevelop an intelligent job scheduling capability in Tapis, where various\nattributes about a job configuration can be automatically determined for the\nuser, and computational resources can be dynamically provisioned by Tapis for\nspecific jobs. We develop an overall architecture for such a feature, which\nsuggests a set of core challenges to be solved. Then, we focus on one such\nspecific challenge: predicting queue times for a job on different HPC systems\nand queues, and we present two sets of results based on machine learning\nmethods. Our first set of results cast the problem as a regression, which can\nbe used to select the best system from a list of existing options. Our second\nset of results frames the problem as a classification, allowing us to compare\nthe use of an existing system with a dynamically provisioned resource.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03349","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The Tapis framework provides APIs for automating job execution on remote
resources, including HPC clusters and servers running in the cloud. Tapis can
simplify the interaction with remote cyberinfrastructure (CI), but the current
services require users to specify the exact configuration of a job to run,
including the system, queue, node count, and maximum run time, among other
attributes. Moreover, the remote resources must be defined and configured in
Tapis before a job can be submitted. In this paper, we present our efforts to
develop an intelligent job scheduling capability in Tapis, where various
attributes about a job configuration can be automatically determined for the
user, and computational resources can be dynamically provisioned by Tapis for
specific jobs. We develop an overall architecture for such a feature, which
suggests a set of core challenges to be solved. Then, we focus on one such
specific challenge: predicting queue times for a job on different HPC systems
and queues, and we present two sets of results based on machine learning
methods. Our first set of results cast the problem as a regression, which can
be used to select the best system from a list of existing options. Our second
set of results frames the problem as a classification, allowing us to compare
the use of an existing system with a dynamically provisioned resource.