An Infrastructure for Efficient Parallel Job Execution in Terascale Computing Environments

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI:10.1109/SC.1998.10026

J. Moreira, W. Chan, L. Fong, H. Franke, M. Jette

{"title":"An Infrastructure for Efficient Parallel Job Execution in Terascale Computing Environments","authors":"J. Moreira, W. Chan, L. Fong, H. Franke, M. Jette","doi":"10.1109/SC.1998.10026","DOIUrl":null,"url":null,"abstract":"Recent Terascale computing environments, such as those in the Department of Energy Accelerated Strategic Computing Initiative, present a new challenge to job scheduling and execution systems. The traditional way to concurrently execute multiple jobs in such large machines is through space-sharing: each job is given dedicated use of a pool of processors. Previous work in this area has demonstrated the benefits of sharing the parallel machine's resources not only spatially but also temporally. Time-sharing creates virtual processors for the execution of jobs. The scheduling is typically performed cyclically and each time-slice of the cycle can be considered an independent virtual machine. When all tasks of a parallel job are scheduled to run on the same time-slice (same virtual machine), gang-scheduling is accomplished. Research has shown that gang-scheduling can greatly improve system utilization and job response time in large parallel systems. We are developing GangLL, a research prototype system for performing gang-scheduling on the ASCI Blue-Pacific machine, an IBM RS/6000 SP to be installed at Lawrence Livermore National Laboratory. This machine consists of several hundred nodes, interconnected by a high-speed communication switch. GangLL is organized as a centralized scheduler that performs global decision-making, and a local daemon in each node that controls job execution according to those decisions. The centralized scheduler builds an Ousterhout matrix that precisely defines the temporal and spatial allocation of tasks in the system. Once the matrix is built, it is distributed to each of the local daemons using a scalable hierarchical distributions scheme. A two-phase commit is used in the distribution scheme to guarantee that all local daemons have consistent information. The local daemons enforce the schedule dedicated by the Ousterhout matrix in their corresponding nodes. This requires suspending and resuming execution of tasks and multiplexing access to the communication switch. Large supercomputing centers tend to have their own job scheduling systems, to handle site specific conditions. Therefore, we are designing GangLL so that it can interact with an external site scheduler. The goal is to let the site scheduler control spatial allocation of jobs, if so desired, and to decide when jobs run. GangLL then performs the detailed temporal allocation and controls the actual execution of jobs. The site scheduler can control the fraction of a shared processor that a job receives through an execution factor parameter. To quantify the benefits of our gang-scheduling system to job execution in a large parallel system, we simulate the system with a realistic workload. We measure performance parameters under various degrees of time-sharing, characterized by the multiprogramming level. Our results show that higher multiprogramming levels lead to higher system utilization and lower job response times. We also report some results from the initial deployment of GangLL on a small multiprocessor system.","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"151 8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the IEEE/ACM SC98 Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.1998.10026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

Abstract

Recent Terascale computing environments, such as those in the Department of Energy Accelerated Strategic Computing Initiative, present a new challenge to job scheduling and execution systems. The traditional way to concurrently execute multiple jobs in such large machines is through space-sharing: each job is given dedicated use of a pool of processors. Previous work in this area has demonstrated the benefits of sharing the parallel machine's resources not only spatially but also temporally. Time-sharing creates virtual processors for the execution of jobs. The scheduling is typically performed cyclically and each time-slice of the cycle can be considered an independent virtual machine. When all tasks of a parallel job are scheduled to run on the same time-slice (same virtual machine), gang-scheduling is accomplished. Research has shown that gang-scheduling can greatly improve system utilization and job response time in large parallel systems. We are developing GangLL, a research prototype system for performing gang-scheduling on the ASCI Blue-Pacific machine, an IBM RS/6000 SP to be installed at Lawrence Livermore National Laboratory. This machine consists of several hundred nodes, interconnected by a high-speed communication switch. GangLL is organized as a centralized scheduler that performs global decision-making, and a local daemon in each node that controls job execution according to those decisions. The centralized scheduler builds an Ousterhout matrix that precisely defines the temporal and spatial allocation of tasks in the system. Once the matrix is built, it is distributed to each of the local daemons using a scalable hierarchical distributions scheme. A two-phase commit is used in the distribution scheme to guarantee that all local daemons have consistent information. The local daemons enforce the schedule dedicated by the Ousterhout matrix in their corresponding nodes. This requires suspending and resuming execution of tasks and multiplexing access to the communication switch. Large supercomputing centers tend to have their own job scheduling systems, to handle site specific conditions. Therefore, we are designing GangLL so that it can interact with an external site scheduler. The goal is to let the site scheduler control spatial allocation of jobs, if so desired, and to decide when jobs run. GangLL then performs the detailed temporal allocation and controls the actual execution of jobs. The site scheduler can control the fraction of a shared processor that a job receives through an execution factor parameter. To quantify the benefits of our gang-scheduling system to job execution in a large parallel system, we simulate the system with a realistic workload. We measure performance parameters under various degrees of time-sharing, characterized by the multiprogramming level. Our results show that higher multiprogramming levels lead to higher system utilization and lower job response times. We also report some results from the initial deployment of GangLL on a small multiprocessor system.

查看原文本刊更多论文

在兆级计算环境中高效并行作业执行的基础结构

最近的Terascale计算环境，例如能源部加速战略计算计划中的那些，对作业调度和执行系统提出了新的挑战。在这种大型机器中并发执行多个作业的传统方法是通过空间共享:为每个作业分配专用的处理器池。在此领域的先前工作已经证明了不仅在空间上而且在时间上共享并行机资源的好处。分时为作业的执行创建了虚拟处理器。调度通常是周期性地执行的，周期中的每个时间片都可以看作是一个独立的虚拟机。当一个并行作业的所有任务都被安排在同一时间片(同一虚拟机)上运行时，就完成了组调度。研究表明，在大型并行系统中，群调度可以极大地提高系统利用率和作业响应时间。我们正在开发GangLL，一个用于在ASCI Blue-Pacific机器上执行队列调度的研究原型系统，这台IBM RS/6000 SP将安装在劳伦斯利弗莫尔国家实验室。这台机器由几百个节点组成，通过高速通信交换机相互连接。GangLL被组织为执行全局决策的集中式调度器，以及每个节点中根据这些决策控制作业执行的本地守护进程。集中式调度程序构建一个Ousterhout矩阵，该矩阵精确地定义了系统中任务的时间和空间分配。构建矩阵之后，使用可伸缩的分层分布方案将其分发到每个本地守护进程。在分发方案中使用两阶段提交来保证所有本地守护进程具有一致的信息。本地守护进程在其相应的节点中执行由Ousterhout矩阵专用的调度。这需要暂停和恢复任务的执行以及对通信交换机的多路复用访问。大型超级计算中心往往有自己的作业调度系统，以处理站点的特定条件。因此，我们正在设计GangLL，以便它可以与外部站点调度程序进行交互。目标是让站点调度器控制作业的空间分配(如果需要的话)，并决定作业何时运行。然后GangLL执行详细的临时分配并控制作业的实际执行。站点调度器可以通过执行因子参数控制作业接收的共享处理器的部分。为了量化我们的组调度系统对大型并行系统中作业执行的好处，我们用实际的工作负载模拟了系统。我们在不同程度的分时下测量性能参数，以多道程序级别为特征。我们的结果表明，更高的多道编程级别会导致更高的系统利用率和更低的作业响应时间。我们还报告了GangLL在小型多处理器系统上的初始部署的一些结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the IEEE/ACM SC98 Conference

自引率

0.00%

发文量