Evaluation of a Workflow Scheduler Using Integrated Performance Modelling and Batch Queue Wait Time Prediction

Daniel Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, K. Kennedy
{"title":"Evaluation of a Workflow Scheduler Using Integrated Performance Modelling and Batch Queue Wait Time Prediction","authors":"Daniel Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, K. Kennedy","doi":"10.1145/1188455.1188579","DOIUrl":null,"url":null,"abstract":"Large-scale distributed systems offer computational power at unprecedented levels. In the past, HPC users typically had access to relatively few individual supercomputers and, in general, would assign a one-to-one mapping of applications to machines. Modern HPC users have simultaneous access to a large number of individual machines and are beginning to make use of all of them for single-application execution cycles. One method that application developers have devised in order to take advantage of such systems is to organize an entire application execution cycle as a workflow. The scheduling of such workflows has been the topic of a great deal of research in the past few years and, although very sophisticated algorithms have been devised, a very specific aspect of these distributed systems, namely that most supercomputing resources employ batch queue scheduling software, has therefore been omitted from consideration, presumably because it is difficult to model accurately. In this work, we augment an existing workflow scheduler through the introduction of methods which make accurate predictions of both the performance of the application on specific hardware, and the amount of time individual workflow tasks would spend waiting in batch queues. Our results show that although a workflow scheduler alone may choose correct task placement based on data locality or network connectivity, this benefit is often compromised by the fact that most jobs submitted to current systems must wait in overcommitted batch queues for a significant portion of time. However, incorporating the enhancements we describe improves workflow execution time in settings where batch queues impose significant delays on constituent workflow tasks","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"195 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"70","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IEEE SC 2006 Conference (SC'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1188455.1188579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 70

Abstract

Large-scale distributed systems offer computational power at unprecedented levels. In the past, HPC users typically had access to relatively few individual supercomputers and, in general, would assign a one-to-one mapping of applications to machines. Modern HPC users have simultaneous access to a large number of individual machines and are beginning to make use of all of them for single-application execution cycles. One method that application developers have devised in order to take advantage of such systems is to organize an entire application execution cycle as a workflow. The scheduling of such workflows has been the topic of a great deal of research in the past few years and, although very sophisticated algorithms have been devised, a very specific aspect of these distributed systems, namely that most supercomputing resources employ batch queue scheduling software, has therefore been omitted from consideration, presumably because it is difficult to model accurately. In this work, we augment an existing workflow scheduler through the introduction of methods which make accurate predictions of both the performance of the application on specific hardware, and the amount of time individual workflow tasks would spend waiting in batch queues. Our results show that although a workflow scheduler alone may choose correct task placement based on data locality or network connectivity, this benefit is often compromised by the fact that most jobs submitted to current systems must wait in overcommitted batch queues for a significant portion of time. However, incorporating the enhancements we describe improves workflow execution time in settings where batch queues impose significant delays on constituent workflow tasks
基于集成性能建模和批处理队列等待时间预测的工作流调度程序评估
大规模分布式系统提供了前所未有的计算能力。在过去,HPC用户通常只能访问相对较少的单个超级计算机,并且通常会将应用程序的一对一映射分配给机器。现代HPC用户可以同时访问大量的独立机器,并开始在单个应用程序执行周期中使用所有这些机器。应用程序开发人员为了利用这种系统而设计的一种方法是将整个应用程序执行周期组织为工作流。这些工作流的调度在过去几年中一直是大量研究的主题,尽管已经设计了非常复杂的算法,但这些分布式系统的一个非常具体的方面,即大多数超级计算资源使用批队列调度软件,因此被忽略了,可能是因为难以准确建模。在这项工作中,我们通过引入一些方法来增强现有的工作流调度器,这些方法可以准确预测应用程序在特定硬件上的性能,以及单个工作流任务在批处理队列中等待的时间。我们的结果表明,尽管工作流调度器本身可以根据数据位置或网络连接选择正确的任务放置,但提交到当前系统的大多数作业必须在过度提交的批处理队列中等待相当长的时间,这一事实往往会损害这种好处。然而,在批处理队列对组成工作流任务造成严重延迟的情况下,结合我们描述的增强功能可以改善工作流执行时间
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信