I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6

International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2012-06-29 DOI:10.1145/2318916.2318919

Jing Fu, R. Latham, M. Min, C. Carothers

{"title":"I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6","authors":"Jing Fu, R. Latham, M. Min, C. Carothers","doi":"10.1145/2318916.2318919","DOIUrl":null,"url":null,"abstract":"Application-level checkpointing has been one of the most popular techniques to proactively deal with unexpected failures in supercomputers with hundreds of thousands of cores. Unfortunately, this approach results in heavy I/O load and often causes I/O bottlenecks in production runs. In this paper, we examine a new thread-based application-level checkpointing for a massively parallel electromagnetic solver system on the IBM Blue Gene/P at Argonne National Laboratory and the Cray XK6 at Oak Ridge National Laboratory. We discuss an I/O-thread based, application-level, two-phase I/O approach, called \"threaded reduced-blocking I/O\" (threaded rbIO), and compare it with a regular version of \"reduced-blocking I/O\" (rbIO) and a tuned MPI-IO collective approach (coIO). Our study shows that threaded rbIO can overlap the I/O latency with computation and achieve near-asynchronous checkpoint with an application-perceived I/O performance of over 70 GB/s (raw of 15 GB/s) and 50 GB/s (raw I/O bandwidth of 17 GB/s) on up to 32K processors of Intrepid and Jaguar, respectively. Compared with rbIO and coIO, the threading approach greatly improves the production performance of NekCEM on Blue Gene/P and Cray XK6 machines by significantly reducing the total simulation time from checkpoint blocking reduction. We also discuss the potential strength of this approach with the Scalable Checkpoint Restart library and on other full-featured operating systems such as that to be deployed on the upcoming Blue Gene/Q.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Workshop on Runtime and Operating Systems for Supercomputers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2318916.2318919","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Application-level checkpointing has been one of the most popular techniques to proactively deal with unexpected failures in supercomputers with hundreds of thousands of cores. Unfortunately, this approach results in heavy I/O load and often causes I/O bottlenecks in production runs. In this paper, we examine a new thread-based application-level checkpointing for a massively parallel electromagnetic solver system on the IBM Blue Gene/P at Argonne National Laboratory and the Cray XK6 at Oak Ridge National Laboratory. We discuss an I/O-thread based, application-level, two-phase I/O approach, called "threaded reduced-blocking I/O" (threaded rbIO), and compare it with a regular version of "reduced-blocking I/O" (rbIO) and a tuned MPI-IO collective approach (coIO). Our study shows that threaded rbIO can overlap the I/O latency with computation and achieve near-asynchronous checkpoint with an application-perceived I/O performance of over 70 GB/s (raw of 15 GB/s) and 50 GB/s (raw I/O bandwidth of 17 GB/s) on up to 32K processors of Intrepid and Jaguar, respectively. Compared with rbIO and coIO, the threading approach greatly improves the production performance of NekCEM on Blue Gene/P and Cray XK6 machines by significantly reducing the total simulation time from checkpoint blocking reduction. We also discuss the potential strength of this approach with the Scalable Checkpoint Restart library and on other full-featured operating systems such as that to be deployed on the upcoming Blue Gene/Q.

查看原文本刊更多论文

在Blue Gene/P和Cray XK6上减少电磁解算器检查点阻塞的I/O线程

在拥有数十万核的超级计算机中，应用程序级检查点已经成为主动处理意外故障的最流行技术之一。不幸的是，这种方法会导致沉重的I/O负载，并经常导致生产运行中的I/O瓶颈。在本文中，我们在阿贡国家实验室的IBM Blue Gene/P和橡树岭国家实验室的Cray XK6上研究了一种新的基于线程的应用级检查点，用于大规模并行电磁求解器系统。我们讨论了一种基于I/O线程的应用程序级两阶段I/O方法，称为“线程减少阻塞I/O”(线程rbIO)，并将其与常规版本的“减少阻塞I/O”(rbIO)和调优的MPI-IO集体方法(coIO)进行比较。我们的研究表明，线程rbIO可以将I/O延迟与计算重叠，并实现近乎异步的检查点，应用程序感知的I/O性能分别超过70 GB/s(原始15 GB/s)和50 GB/s(原始I/O带宽为17 GB/s)在高达32K的Intrepid和Jaguar处理器上。与rbIO和coIO相比，线程方法通过显著减少检查点阻塞减少的总模拟时间，大大提高了NekCEM在Blue Gene/P和Cray XK6机器上的生产性能。我们还讨论了这种方法在可伸缩检查点重启库和其他全功能操作系统(如即将部署在Blue Gene/Q上的操作系统)上的潜在优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Workshop on Runtime and Operating Systems for Supercomputers

自引率

0.00%

发文量