Toward Smart Scheduling in Tapis

arXiv - CS - Performance Pub Date : 2024-08-05 DOI:arxiv-2408.03349

Joe Stubbs, Smruti Padhy, Richard Cardone

{"title":"Toward Smart Scheduling in Tapis","authors":"Joe Stubbs, Smruti Padhy, Richard Cardone","doi":"arxiv-2408.03349","DOIUrl":null,"url":null,"abstract":"The Tapis framework provides APIs for automating job execution on remote\nresources, including HPC clusters and servers running in the cloud. Tapis can\nsimplify the interaction with remote cyberinfrastructure (CI), but the current\nservices require users to specify the exact configuration of a job to run,\nincluding the system, queue, node count, and maximum run time, among other\nattributes. Moreover, the remote resources must be defined and configured in\nTapis before a job can be submitted. In this paper, we present our efforts to\ndevelop an intelligent job scheduling capability in Tapis, where various\nattributes about a job configuration can be automatically determined for the\nuser, and computational resources can be dynamically provisioned by Tapis for\nspecific jobs. We develop an overall architecture for such a feature, which\nsuggests a set of core challenges to be solved. Then, we focus on one such\nspecific challenge: predicting queue times for a job on different HPC systems\nand queues, and we present two sets of results based on machine learning\nmethods. Our first set of results cast the problem as a regression, which can\nbe used to select the best system from a list of existing options. Our second\nset of results frames the problem as a classification, allowing us to compare\nthe use of an existing system with a dynamically provisioned resource.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03349","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The Tapis framework provides APIs for automating job execution on remote resources, including HPC clusters and servers running in the cloud. Tapis can simplify the interaction with remote cyberinfrastructure (CI), but the current services require users to specify the exact configuration of a job to run, including the system, queue, node count, and maximum run time, among other attributes. Moreover, the remote resources must be defined and configured in Tapis before a job can be submitted. In this paper, we present our efforts to develop an intelligent job scheduling capability in Tapis, where various attributes about a job configuration can be automatically determined for the user, and computational resources can be dynamically provisioned by Tapis for specific jobs. We develop an overall architecture for such a feature, which suggests a set of core challenges to be solved. Then, we focus on one such specific challenge: predicting queue times for a job on different HPC systems and queues, and we present two sets of results based on machine learning methods. Our first set of results cast the problem as a regression, which can be used to select the best system from a list of existing options. Our second set of results frames the problem as a classification, allowing us to compare the use of an existing system with a dynamically provisioned resource.

查看原文本刊更多论文

在 Tapis 中实现智能调度

Tapis 框架提供了在远程资源（包括高性能计算集群和云中运行的服务器）上自动执行作业的 API。Tapis可以简化与远程网络基础设施（CI）的交互，但目前的服务要求用户指定作业运行的确切配置，包括系统、队列、节点数和最长运行时间等属性。此外，在提交作业之前，还必须在 Tapis 中定义和配置远程资源。在本文中，我们介绍了在 Tapis 中开发智能作业调度功能的努力，在这种功能中，可以为用户自动确定作业配置的各种属性，并由 Tapis 为特定作业动态调配计算资源。我们为这种功能开发了一个整体架构，并提出了一系列需要解决的核心挑战。然后，我们重点讨论了其中一个具体挑战：预测作业在不同高性能计算系统和队列上的排队时间，并介绍了基于机器学习方法的两组结果。我们的第一组结果将问题归结为回归，可用于从现有选项列表中选择最佳系统。我们的第二组结果将问题归结为分类，使我们能够比较现有系统和动态调配资源的使用情况。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Performance

自引率

0.00%

发文量