International Workshop on Runtime and Operating Systems for Supercomputers最新文献_第2页

A design of hybrid operating system for a parallel computer with multi-core and many-core processors 多核与多核并行计算机混合操作系统的设计

International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2012-06-29 DOI: 10.1145/2318916.2318927

Mikiko Sato, Go Fukazawa, Kiyohiko Nagamine, Ryuichi Sakamoto, M. Namiki, Kazumi Yoshinaga, Y. Tsujita, A. Hori, Y. Ishikawa

{"title":"A design of hybrid operating system for a parallel computer with multi-core and many-core processors","authors":"Mikiko Sato, Go Fukazawa, Kiyohiko Nagamine, Ryuichi Sakamoto, M. Namiki, Kazumi Yoshinaga, Y. Tsujita, A. Hori, Y. Ishikawa","doi":"10.1145/2318916.2318927","DOIUrl":"https://doi.org/10.1145/2318916.2318927","url":null,"abstract":"This paper describes the design of an operating system to manage the hybrid computer system architecture with multi-core and many-core processors for Exa-scale computing. In this study, a host operating system (Host OS) on a multi-core processor performs some functions of a lightweight operating system (LWOS) on a many-core processor, in order to dedicate to executing the application program on a many-core processor. In particular, to ensure that LWOS execution does not disturb the application program executed on the many-core processor, the functions such as process management, memory management, and I/O management are delegated to the Host OS. To demonstrate this design, we made an prototype system of a computer equipped with a multi-core processor and a many-core processor using an Intel Xeon dual-core processor system. The Linux and original LWOS were loaded on to each processor and the overhead for executing the program for LWOS from Linux was evaluated. Using this prototype system, the LWOS process can be started with at least 110 μsec overhead for the many-core program.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133158038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Supercomputing operating systems: a naive view from over the fence 超级计算操作系统:一种超越藩篱的天真看法

International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2012-06-29 DOI: 10.1145/2318916.2318917

Timothy Roscoe

{"title":"Supercomputing operating systems: a naive view from over the fence","authors":"Timothy Roscoe","doi":"10.1145/2318916.2318917","DOIUrl":"https://doi.org/10.1145/2318916.2318917","url":null,"abstract":"To exaggerate unfairly, from the perspective of mainstream OS research, the supercomputing community has a very different idea of the role (and appropriate design) of an OS. HPC people regard the OS as an annoying source of noise, whereas the former crowd see it as a thing of wondrous beauty and elegance, a sine qua non of usable everyday computing.\u0000 This situation has existed without serious conflict erupting for years: OS researchers worried about PCs with one core (or, at most, a handful of cores) running a general-purpose OS and supporting a dynamic, bursty, diverse mix of hundreds of interactive, long-running, soft-realtime and/or background processes. Supercomputing people wanted one, highly parallel, program to finish as soon as possible so they could get on to the next one.\u0000 With multicore, this all changed: highly parallel tasks will be the norm for future general-purpose computing. In 2007, my colleagues and I eagerly embarked on a new research OS for multicore computing, and looked forward to applying long-ignored (in our field) results from the HPC realm to our system.\u0000 It didn't quite work out that way. In this talk I will look at what we found to be common to the two fields, and what we didn't, and speculate on where this might be going. I think there is a useful conversation to be had, and I'd like to help revive it.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114644143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Node-based memory management for scalable NUMA architectures 可扩展NUMA架构的基于节点的内存管理

International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2012-06-29 DOI: 10.1145/2318916.2318929

Stefan Lankes, T. Bemmerl, T. Roehl, C. Terboven

引用次数: 5

Integrated in-system storage architecture for high performance computing 集成系统内存储架构，实现高性能计算

International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2012-06-29 DOI: 10.1145/2318916.2318921

D. Kimpe, K. Mohror, A. Moody, B. V. Essen, M. Gokhale, R. Ross, B. Supinski

{"title":"Integrated in-system storage architecture for high performance computing","authors":"D. Kimpe, K. Mohror, A. Moody, B. V. Essen, M. Gokhale, R. Ross, B. Supinski","doi":"10.1145/2318916.2318921","DOIUrl":"https://doi.org/10.1145/2318916.2318921","url":null,"abstract":"In-system solid state storage is expected to be an important component of the I/O subsystem on the first exascale platforms, as it has the potential to reduce DRAM requirements, to increase system reliability, and to smooth I/O loads.\u0000 This paper describes the design of a prototype, integrated in-system storage architecture that we are developing to serve the diverse needs of high performance computing. Our container abstraction will provide lightweight management of in-system storage devices, as well as methods to access containers remotely and to transfer them within the storage hierarchy. We are also working on a storage hierarchy abstraction API to provide portable HPC I/O software with the critical information on the configuration of the system on which it is running. As currently available large-scale HPC systems lack in-system storage, we are developing a solid state storage simulator backed by DRAM. We are integrating these efforts around an I/O-intensive workload provided by the scalable checkpoint/restart (SCR) library. We expect our efforts to reduce the overheads of checkpointing and data movement across the system and thus to improve the scalability and reliability of HPC applications.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123168484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Evaluating operating system vulnerability to memory errors 评估操作系统对内存错误的脆弱性

International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2012-06-29 DOI: 10.1145/2318916.2318930

Kurt B. Ferreira, K. Pedretti, R. Brightwell, P. Bridges, David Fiala, F. Mueller

{"title":"Evaluating operating system vulnerability to memory errors","authors":"Kurt B. Ferreira, K. Pedretti, R. Brightwell, P. Bridges, David Fiala, F. Mueller","doi":"10.1145/2318916.2318930","DOIUrl":"https://doi.org/10.1145/2318916.2318930","url":null,"abstract":"Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126015147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6 在Blue Gene/P和Cray XK6上减少电磁解算器检查点阻塞的I/O线程

International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2012-06-29 DOI: 10.1145/2318916.2318919

Jing Fu, R. Latham, M. Min, C. Carothers

{"title":"I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6","authors":"Jing Fu, R. Latham, M. Min, C. Carothers","doi":"10.1145/2318916.2318919","DOIUrl":"https://doi.org/10.1145/2318916.2318919","url":null,"abstract":"Application-level checkpointing has been one of the most popular techniques to proactively deal with unexpected failures in supercomputers with hundreds of thousands of cores. Unfortunately, this approach results in heavy I/O load and often causes I/O bottlenecks in production runs. In this paper, we examine a new thread-based application-level checkpointing for a massively parallel electromagnetic solver system on the IBM Blue Gene/P at Argonne National Laboratory and the Cray XK6 at Oak Ridge National Laboratory. We discuss an I/O-thread based, application-level, two-phase I/O approach, called \"threaded reduced-blocking I/O\" (threaded rbIO), and compare it with a regular version of \"reduced-blocking I/O\" (rbIO) and a tuned MPI-IO collective approach (coIO). Our study shows that threaded rbIO can overlap the I/O latency with computation and achieve near-asynchronous checkpoint with an application-perceived I/O performance of over 70 GB/s (raw of 15 GB/s) and 50 GB/s (raw I/O bandwidth of 17 GB/s) on up to 32K processors of Intrepid and Jaguar, respectively. Compared with rbIO and coIO, the threading approach greatly improves the production performance of NekCEM on Blue Gene/P and Cray XK6 machines by significantly reducing the total simulation time from checkpoint blocking reduction. We also discuss the potential strength of this approach with the Scalable Checkpoint Restart library and on other full-featured operating systems such as that to be deployed on the upcoming Blue Gene/Q.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134485289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Stepping towards noiseless Linux environment 迈向无噪音的Linux环境

International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2012-06-29 DOI: 10.1145/2318916.2318925

Hakan Akkan, M. Lang, L. Liebrock

{"title":"Stepping towards noiseless Linux environment","authors":"Hakan Akkan, M. Lang, L. Liebrock","doi":"10.1145/2318916.2318925","DOIUrl":"https://doi.org/10.1145/2318916.2318925","url":null,"abstract":"Scientific applications are interrupted by the operating system far too often. Historically operating systems have been written efficiently to time-share a single resource, the CPU. We now have an abundance of cores but we are still swapping out the application to run other tasks and therefore increasing the application's time to solution. Current task scheduling in Linux is not tuned for a high performance computing environment, where a single job is running on all available cores. For example, checking for context switches hundreds of times per second is counter-productive in this setting.\u0000 One solution to this problem is to partition the cores between operating system and application; with the advent of many-core processors this approach is more attractive. This work describes our investigation of isolation of application processes from the operating system using a soft-partitioning scheme. We use increasingly invasive approaches; from configuration changes with available Linux features such as control groups and pinning interrupts using the CPU affinity settings, to invasive source level code changes to try to reduce, or in some cases completely eliminate, application interruptions such as OS clock ticks and timers.\u0000 Explained here are the measures that can be taken to reduce application interruption solely with compile and run time configurations in a recent unmodified Linux kernel. Although these measures have been available for a some time, to our knowledge, they have never been addressed in an HPC context. We then introduce our invasive method, where we remove the involuntary preemption induced by task scheduling. Our experiments show that parallel applications benefit from these modifications even at relatively small scales. At the modest scale of our testbed, we see a 1.72% improvement that should project into higher benefits at extreme scales.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114908754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

The RAMDISK storage accelerator: a method of accelerating I/O performance on HPC systems using RAMDISKs RAMDISK存储加速器:一种在HPC系统上使用RAMDISK加速I/O性能的方法

International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2012-06-29 DOI: 10.1145/2318916.2318922

Tim Wickberg, C. Carothers

{"title":"The RAMDISK storage accelerator: a method of accelerating I/O performance on HPC systems using RAMDISKs","authors":"Tim Wickberg, C. Carothers","doi":"10.1145/2318916.2318922","DOIUrl":"https://doi.org/10.1145/2318916.2318922","url":null,"abstract":"I/O performance in large-scale HPC systems has not kept pace with improvements in computational performance. This widening gap presents an opportunity to introduce a new layer into the HPC environment that specifically targets this divide. A RAMDISK Storage Accelerator (RSA) is proposed; a system leveraging the high-throughput and decreasing cost of DRAM to provide an application-transparent method for pre-staging input data and commit results back to a persistent disk storage system.\u0000 The RSA is constructed from a set of individual RSA nodes; each with large amounts of DRAM and a high-speed connection to the storage network. Memory from each node is made available through a dynamically constructed parallel filesystem to a compute job; data is asynchronously staged on to the RAMDISK ahead of compute job start and written back out to the persistent disk system after job completion. The RAMDISK provides very-high-speed, low-latency temporary storage that is dedicated to a specific job. Asynchronous data-staging frees the compute system from time that would otherwise be spent waiting for file I/O to finish at the start and end of execution. The RSA Scheduler is constructed to demonstrate this asynchronous data-staging model.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124953427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Optimizing latency and throughput for spawning processes on massively multicore processors 优化大规模多核处理器上生成进程的延迟和吞吐量

International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2012-06-29 DOI: 10.1145/2318916.2318924

Abhishek Kulkarni, A. Lumsdaine, M. Lang, Latchesar Ionkov

{"title":"Optimizing latency and throughput for spawning processes on massively multicore processors","authors":"Abhishek Kulkarni, A. Lumsdaine, M. Lang, Latchesar Ionkov","doi":"10.1145/2318916.2318924","DOIUrl":"https://doi.org/10.1145/2318916.2318924","url":null,"abstract":"The execution of a SPMD application involves running multiple instances of a process with possibly varying arguments. With the widespread adoption of massively multicore processors, there has been a focus towards harnessing the abundant compute resources effectively in a power-efficient manner. Although much work has been done towards optimizing distributed process launch using hierarchical techniques, there has been a void in studying the performance of spawning processes within a single node. Reducing the latency to spawn a new process locally results in faster global job launch. Further, emerging dynamic and resilient execution models are designed on the premise of maintaining process pools for fault isolation and launching several processes in a relatively shorter period of time. Optimizing the latency and throughput for spawning processes would help improve the overall performance of runtime systems, allow adaptive process-replication reliability and motivate the design and implementation of process management interfaces in future manycore operating systems.\u0000 In this paper, we study the several limiting factors for efficient spawning of processes on massively multicore architectures. We have developed a library to optimize launching multiple instances of the same executable. Our microbenchmarks show a 20-80% decrease in the process spawn time for multiple executables. We further discuss the effects of memory locality and propose NUMA-aware extensions to optimize launching processes with large memory-mapped segments including dynamic shared libraries. Finally, we describe vector operating system interfaces for spawning a batch of processes from a given executable on specific cores. Our results show a 50x speedup over the traditional method of launching new processes using fork and exec system calls.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116964338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Better than native: using virtualization to improve compute node performance 优于本机:使用虚拟化技术提高计算节点性能

International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2012-06-29 DOI: 10.1145/2318916.2318926

Brian Kocoloski, J. Lange

引用次数: 14