Scalable, fault-tolerant job step management for high-performance systems

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development Pub Date : 2019-12-10 DOI:10.1147/JRD.2019.2958909

D. Solt;J. Hursey;A. Lauria;D. Guo;X. Guo

{"title":"Scalable, fault-tolerant job step management for high-performance systems","authors":"D. Solt;J. Hursey;A. Lauria;D. Guo;X. Guo","doi":"10.1147/JRD.2019.2958909","DOIUrl":null,"url":null,"abstract":"Scientific applications on the CORAL systems demanded a fault-tolerant, scalable job launch infrastructure for complex workflows with multiple job steps within an allocation. The distinct design of IBM's Job Step Manager (JSM) infrastructure, working in concert with Load Sharing Facility (LSF) and Cluster System Management (CSM), achieves these goals. JSM demonstrated launching over three-quarters of a million processes in under a minute while providing efficient process management interface for exascale-based services to communication libraries, such as parallel active messaging interface and message passing interface, and tools over the management network. JSM relies on the parallel task support library to provide a fault-tolerant, scalable communication medium between the JSM daemons. Application workflows using job steps harness the unique resource set abstraction concept in JSM to manage CPUs, GPUs, and memory between groups of processes, possibly in discrete job steps, sharing a node. The resource set concept gives JSM the opportunity to better organize process placement to optimize, for example, CPU-to-GPU communication. Applications that need complete control over the shaping of the resource sets and the placement, binding, and ordering of processes within them can leverage JSM's co-designed Explicit Resource File mechanism. This article explores the design decisions, implementation considerations, and performance optimizations of IBM's JSM infrastructure to support scientific discovery on the CORAL systems.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"8:1-8:9"},"PeriodicalIF":1.3000,"publicationDate":"2019-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2958909","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IBM Journal of Research and Development","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/8930300/","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 1

Abstract

Scientific applications on the CORAL systems demanded a fault-tolerant, scalable job launch infrastructure for complex workflows with multiple job steps within an allocation. The distinct design of IBM's Job Step Manager (JSM) infrastructure, working in concert with Load Sharing Facility (LSF) and Cluster System Management (CSM), achieves these goals. JSM demonstrated launching over three-quarters of a million processes in under a minute while providing efficient process management interface for exascale-based services to communication libraries, such as parallel active messaging interface and message passing interface, and tools over the management network. JSM relies on the parallel task support library to provide a fault-tolerant, scalable communication medium between the JSM daemons. Application workflows using job steps harness the unique resource set abstraction concept in JSM to manage CPUs, GPUs, and memory between groups of processes, possibly in discrete job steps, sharing a node. The resource set concept gives JSM the opportunity to better organize process placement to optimize, for example, CPU-to-GPU communication. Applications that need complete control over the shaping of the resource sets and the placement, binding, and ordering of processes within them can leverage JSM's co-designed Explicit Resource File mechanism. This article explores the design decisions, implementation considerations, and performance optimizations of IBM's JSM infrastructure to support scientific discovery on the CORAL systems.

查看原文本刊更多论文

用于高性能系统的可伸缩、容错作业步骤管理

CORAL系统上的科学应用需要一个容错的、可扩展的作业启动基础设施，用于在一个分配中包含多个作业步骤的复杂工作流。IBM的作业步骤管理器(Job Step Manager, JSM)基础设施的独特设计与负载共享设施(Load Sharing Facility, LSF)和集群系统管理(Cluster System Management, CSM)协同工作，实现了这些目标。JSM演示了在不到一分钟的时间内启动超过75万个进程，同时为基于exascale的服务提供高效的进程管理接口，用于通信库，例如并行活动消息传递接口和消息传递接口，以及管理网络上的工具。JSM依赖并行任务支持库在JSM守护进程之间提供容错、可扩展的通信媒介。使用作业步骤的应用程序工作流利用JSM中独特的资源集抽象概念来管理进程组之间的cpu、gpu和内存(可能在离散的作业步骤中)，共享一个节点。资源集概念使JSM有机会更好地组织进程放置以优化，例如cpu到gpu的通信。需要完全控制资源集的形成以及其中的进程的放置、绑定和排序的应用程序可以利用JSM共同设计的显式资源文件机制。本文探讨了IBM JSM基础设施的设计决策、实现注意事项和性能优化，以支持CORAL系统上的科学发现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IBM Journal of Research and Development 工程技术-计算机：硬件

自引率

0.00%

发文量

审稿时长

6-12 weeks

期刊介绍： The IBM Journal of Research and Development is a peer-reviewed technical journal, published bimonthly, which features the work of authors in the science, technology and engineering of information systems. Papers are written for the worldwide scientific research and development community and knowledgeable professionals. Submitted papers are welcome from the IBM technical community and from non-IBM authors on topics relevant to the scientific and technical content of the Journal.