在 Tapis 中实现智能调度

Joe Stubbs, Smruti Padhy, Richard Cardone
{"title":"在 Tapis 中实现智能调度","authors":"Joe Stubbs, Smruti Padhy, Richard Cardone","doi":"arxiv-2408.03349","DOIUrl":null,"url":null,"abstract":"The Tapis framework provides APIs for automating job execution on remote\nresources, including HPC clusters and servers running in the cloud. Tapis can\nsimplify the interaction with remote cyberinfrastructure (CI), but the current\nservices require users to specify the exact configuration of a job to run,\nincluding the system, queue, node count, and maximum run time, among other\nattributes. Moreover, the remote resources must be defined and configured in\nTapis before a job can be submitted. In this paper, we present our efforts to\ndevelop an intelligent job scheduling capability in Tapis, where various\nattributes about a job configuration can be automatically determined for the\nuser, and computational resources can be dynamically provisioned by Tapis for\nspecific jobs. We develop an overall architecture for such a feature, which\nsuggests a set of core challenges to be solved. Then, we focus on one such\nspecific challenge: predicting queue times for a job on different HPC systems\nand queues, and we present two sets of results based on machine learning\nmethods. Our first set of results cast the problem as a regression, which can\nbe used to select the best system from a list of existing options. Our second\nset of results frames the problem as a classification, allowing us to compare\nthe use of an existing system with a dynamically provisioned resource.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Toward Smart Scheduling in Tapis\",\"authors\":\"Joe Stubbs, Smruti Padhy, Richard Cardone\",\"doi\":\"arxiv-2408.03349\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Tapis framework provides APIs for automating job execution on remote\\nresources, including HPC clusters and servers running in the cloud. Tapis can\\nsimplify the interaction with remote cyberinfrastructure (CI), but the current\\nservices require users to specify the exact configuration of a job to run,\\nincluding the system, queue, node count, and maximum run time, among other\\nattributes. Moreover, the remote resources must be defined and configured in\\nTapis before a job can be submitted. In this paper, we present our efforts to\\ndevelop an intelligent job scheduling capability in Tapis, where various\\nattributes about a job configuration can be automatically determined for the\\nuser, and computational resources can be dynamically provisioned by Tapis for\\nspecific jobs. We develop an overall architecture for such a feature, which\\nsuggests a set of core challenges to be solved. Then, we focus on one such\\nspecific challenge: predicting queue times for a job on different HPC systems\\nand queues, and we present two sets of results based on machine learning\\nmethods. Our first set of results cast the problem as a regression, which can\\nbe used to select the best system from a list of existing options. Our second\\nset of results frames the problem as a classification, allowing us to compare\\nthe use of an existing system with a dynamically provisioned resource.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.03349\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.03349","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

Tapis 框架提供了在远程资源(包括高性能计算集群和云中运行的服务器)上自动执行作业的 API。Tapis可以简化与远程网络基础设施(CI)的交互,但目前的服务要求用户指定作业运行的确切配置,包括系统、队列、节点数和最长运行时间等属性。此外,在提交作业之前,还必须在 Tapis 中定义和配置远程资源。在本文中,我们介绍了在 Tapis 中开发智能作业调度功能的努力,在这种功能中,可以为用户自动确定作业配置的各种属性,并由 Tapis 为特定作业动态调配计算资源。我们为这种功能开发了一个整体架构,并提出了一系列需要解决的核心挑战。然后,我们重点讨论了其中一个具体挑战:预测作业在不同高性能计算系统和队列上的排队时间,并介绍了基于机器学习方法的两组结果。我们的第一组结果将问题归结为回归,可用于从现有选项列表中选择最佳系统。我们的第二组结果将问题归结为分类,使我们能够比较现有系统和动态调配资源的使用情况。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Toward Smart Scheduling in Tapis
The Tapis framework provides APIs for automating job execution on remote resources, including HPC clusters and servers running in the cloud. Tapis can simplify the interaction with remote cyberinfrastructure (CI), but the current services require users to specify the exact configuration of a job to run, including the system, queue, node count, and maximum run time, among other attributes. Moreover, the remote resources must be defined and configured in Tapis before a job can be submitted. In this paper, we present our efforts to develop an intelligent job scheduling capability in Tapis, where various attributes about a job configuration can be automatically determined for the user, and computational resources can be dynamically provisioned by Tapis for specific jobs. We develop an overall architecture for such a feature, which suggests a set of core challenges to be solved. Then, we focus on one such specific challenge: predicting queue times for a job on different HPC systems and queues, and we present two sets of results based on machine learning methods. Our first set of results cast the problem as a regression, which can be used to select the best system from a list of existing options. Our second set of results frames the problem as a classification, allowing us to compare the use of an existing system with a dynamically provisioned resource.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信