Evaluation of a Workflow Scheduler Using Integrated Performance Modelling and Batch Queue Wait Time Prediction

ACM/IEEE SC 2006 Conference (SC'06) Pub Date : 2006-11-11 DOI:10.1145/1188455.1188579

Daniel Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, K. Kennedy

{"title":"Evaluation of a Workflow Scheduler Using Integrated Performance Modelling and Batch Queue Wait Time Prediction","authors":"Daniel Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, K. Kennedy","doi":"10.1145/1188455.1188579","DOIUrl":null,"url":null,"abstract":"Large-scale distributed systems offer computational power at unprecedented levels. In the past, HPC users typically had access to relatively few individual supercomputers and, in general, would assign a one-to-one mapping of applications to machines. Modern HPC users have simultaneous access to a large number of individual machines and are beginning to make use of all of them for single-application execution cycles. One method that application developers have devised in order to take advantage of such systems is to organize an entire application execution cycle as a workflow. The scheduling of such workflows has been the topic of a great deal of research in the past few years and, although very sophisticated algorithms have been devised, a very specific aspect of these distributed systems, namely that most supercomputing resources employ batch queue scheduling software, has therefore been omitted from consideration, presumably because it is difficult to model accurately. In this work, we augment an existing workflow scheduler through the introduction of methods which make accurate predictions of both the performance of the application on specific hardware, and the amount of time individual workflow tasks would spend waiting in batch queues. Our results show that although a workflow scheduler alone may choose correct task placement based on data locality or network connectivity, this benefit is often compromised by the fact that most jobs submitted to current systems must wait in overcommitted batch queues for a significant portion of time. However, incorporating the enhancements we describe improves workflow execution time in settings where batch queues impose significant delays on constituent workflow tasks","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"195 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"70","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM/IEEE SC 2006 Conference (SC'06)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1188455.1188579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 70

Abstract

Large-scale distributed systems offer computational power at unprecedented levels. In the past, HPC users typically had access to relatively few individual supercomputers and, in general, would assign a one-to-one mapping of applications to machines. Modern HPC users have simultaneous access to a large number of individual machines and are beginning to make use of all of them for single-application execution cycles. One method that application developers have devised in order to take advantage of such systems is to organize an entire application execution cycle as a workflow. The scheduling of such workflows has been the topic of a great deal of research in the past few years and, although very sophisticated algorithms have been devised, a very specific aspect of these distributed systems, namely that most supercomputing resources employ batch queue scheduling software, has therefore been omitted from consideration, presumably because it is difficult to model accurately. In this work, we augment an existing workflow scheduler through the introduction of methods which make accurate predictions of both the performance of the application on specific hardware, and the amount of time individual workflow tasks would spend waiting in batch queues. Our results show that although a workflow scheduler alone may choose correct task placement based on data locality or network connectivity, this benefit is often compromised by the fact that most jobs submitted to current systems must wait in overcommitted batch queues for a significant portion of time. However, incorporating the enhancements we describe improves workflow execution time in settings where batch queues impose significant delays on constituent workflow tasks

查看原文本刊更多论文

基于集成性能建模和批处理队列等待时间预测的工作流调度程序评估

大规模分布式系统提供了前所未有的计算能力。在过去，HPC用户通常只能访问相对较少的单个超级计算机，并且通常会将应用程序的一对一映射分配给机器。现代HPC用户可以同时访问大量的独立机器，并开始在单个应用程序执行周期中使用所有这些机器。应用程序开发人员为了利用这种系统而设计的一种方法是将整个应用程序执行周期组织为工作流。这些工作流的调度在过去几年中一直是大量研究的主题，尽管已经设计了非常复杂的算法，但这些分布式系统的一个非常具体的方面，即大多数超级计算资源使用批队列调度软件，因此被忽略了，可能是因为难以准确建模。在这项工作中，我们通过引入一些方法来增强现有的工作流调度器，这些方法可以准确预测应用程序在特定硬件上的性能，以及单个工作流任务在批处理队列中等待的时间。我们的结果表明，尽管工作流调度器本身可以根据数据位置或网络连接选择正确的任务放置，但提交到当前系统的大多数作业必须在过度提交的批处理队列中等待相当长的时间，这一事实往往会损害这种好处。然而，在批处理队列对组成工作流任务造成严重延迟的情况下，结合我们描述的增强功能可以改善工作流执行时间

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM/IEEE SC 2006 Conference (SC'06)

自引率

0.00%

发文量