Mikiko Sato, Go Fukazawa, Kiyohiko Nagamine, Ryuichi Sakamoto, M. Namiki, Kazumi Yoshinaga, Y. Tsujita, A. Hori, Y. Ishikawa
{"title":"A design of hybrid operating system for a parallel computer with multi-core and many-core processors","authors":"Mikiko Sato, Go Fukazawa, Kiyohiko Nagamine, Ryuichi Sakamoto, M. Namiki, Kazumi Yoshinaga, Y. Tsujita, A. Hori, Y. Ishikawa","doi":"10.1145/2318916.2318927","DOIUrl":"https://doi.org/10.1145/2318916.2318927","url":null,"abstract":"This paper describes the design of an operating system to manage the hybrid computer system architecture with multi-core and many-core processors for Exa-scale computing. In this study, a host operating system (Host OS) on a multi-core processor performs some functions of a lightweight operating system (LWOS) on a many-core processor, in order to dedicate to executing the application program on a many-core processor. In particular, to ensure that LWOS execution does not disturb the application program executed on the many-core processor, the functions such as process management, memory management, and I/O management are delegated to the Host OS. To demonstrate this design, we made an prototype system of a computer equipped with a multi-core processor and a many-core processor using an Intel Xeon dual-core processor system. The Linux and original LWOS were loaded on to each processor and the overhead for executing the program for LWOS from Linux was evaluated. Using this prototype system, the LWOS process can be started with at least 110 μsec overhead for the many-core program.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133158038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supercomputing operating systems: a naive view from over the fence","authors":"Timothy Roscoe","doi":"10.1145/2318916.2318917","DOIUrl":"https://doi.org/10.1145/2318916.2318917","url":null,"abstract":"To exaggerate unfairly, from the perspective of mainstream OS research, the supercomputing community has a very different idea of the role (and appropriate design) of an OS. HPC people regard the OS as an annoying source of noise, whereas the former crowd see it as a thing of wondrous beauty and elegance, a sine qua non of usable everyday computing.\u0000 This situation has existed without serious conflict erupting for years: OS researchers worried about PCs with one core (or, at most, a handful of cores) running a general-purpose OS and supporting a dynamic, bursty, diverse mix of hundreds of interactive, long-running, soft-realtime and/or background processes. Supercomputing people wanted one, highly parallel, program to finish as soon as possible so they could get on to the next one.\u0000 With multicore, this all changed: highly parallel tasks will be the norm for future general-purpose computing. In 2007, my colleagues and I eagerly embarked on a new research OS for multicore computing, and looked forward to applying long-ignored (in our field) results from the HPC realm to our system.\u0000 It didn't quite work out that way. In this talk I will look at what we found to be common to the two fields, and what we didn't, and speculate on where this might be going. I think there is a useful conversation to be had, and I'd like to help revive it.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114644143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Node-based memory management for scalable NUMA architectures","authors":"Stefan Lankes, T. Bemmerl, T. Roehl, C. Terboven","doi":"10.1145/2318916.2318929","DOIUrl":"https://doi.org/10.1145/2318916.2318929","url":null,"abstract":"Large state-of-the-art NUMA systems may offer more than two levels of node distances. The result is a hierarchical architecture with significant differences in memory access bandwidth and latency. Consequently, NUMA-aware memory management and the reduction of remote memory accesses becomes more and more the key challenge for the operating system and its applications. In this paper, we will show that traditional, centralized concepts to realize paging are not longer an adequate approach for these architectures. We present a prototype of new node-based memory management for the Linux kernel and prove its scalability and usability. The hardware architecture is reflected by managing one page mapping table per NUMA node and the kernel's page fault handler is extended to create node-local references. Based on this prototype, we suggest extensions to simplify the detection of performance issues, which will increase the usability of such architectures.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116596423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Kimpe, K. Mohror, A. Moody, B. V. Essen, M. Gokhale, R. Ross, B. Supinski
{"title":"Integrated in-system storage architecture for high performance computing","authors":"D. Kimpe, K. Mohror, A. Moody, B. V. Essen, M. Gokhale, R. Ross, B. Supinski","doi":"10.1145/2318916.2318921","DOIUrl":"https://doi.org/10.1145/2318916.2318921","url":null,"abstract":"In-system solid state storage is expected to be an important component of the I/O subsystem on the first exascale platforms, as it has the potential to reduce DRAM requirements, to increase system reliability, and to smooth I/O loads.\u0000 This paper describes the design of a prototype, integrated in-system storage architecture that we are developing to serve the diverse needs of high performance computing. Our container abstraction will provide lightweight management of in-system storage devices, as well as methods to access containers remotely and to transfer them within the storage hierarchy. We are also working on a storage hierarchy abstraction API to provide portable HPC I/O software with the critical information on the configuration of the system on which it is running. As currently available large-scale HPC systems lack in-system storage, we are developing a solid state storage simulator backed by DRAM. We are integrating these efforts around an I/O-intensive workload provided by the scalable checkpoint/restart (SCR) library. We expect our efforts to reduce the overheads of checkpointing and data movement across the system and thus to improve the scalability and reliability of HPC applications.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123168484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kurt B. Ferreira, K. Pedretti, R. Brightwell, P. Bridges, David Fiala, F. Mueller
{"title":"Evaluating operating system vulnerability to memory errors","authors":"Kurt B. Ferreira, K. Pedretti, R. Brightwell, P. Bridges, David Fiala, F. Mueller","doi":"10.1145/2318916.2318930","DOIUrl":"https://doi.org/10.1145/2318916.2318930","url":null,"abstract":"Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126015147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"I/O threads to reduce checkpoint blocking for an electromagnetics solver on Blue Gene/P and Cray XK6","authors":"Jing Fu, R. Latham, M. Min, C. Carothers","doi":"10.1145/2318916.2318919","DOIUrl":"https://doi.org/10.1145/2318916.2318919","url":null,"abstract":"Application-level checkpointing has been one of the most popular techniques to proactively deal with unexpected failures in supercomputers with hundreds of thousands of cores. Unfortunately, this approach results in heavy I/O load and often causes I/O bottlenecks in production runs. In this paper, we examine a new thread-based application-level checkpointing for a massively parallel electromagnetic solver system on the IBM Blue Gene/P at Argonne National Laboratory and the Cray XK6 at Oak Ridge National Laboratory. We discuss an I/O-thread based, application-level, two-phase I/O approach, called \"threaded reduced-blocking I/O\" (threaded rbIO), and compare it with a regular version of \"reduced-blocking I/O\" (rbIO) and a tuned MPI-IO collective approach (coIO). Our study shows that threaded rbIO can overlap the I/O latency with computation and achieve near-asynchronous checkpoint with an application-perceived I/O performance of over 70 GB/s (raw of 15 GB/s) and 50 GB/s (raw I/O bandwidth of 17 GB/s) on up to 32K processors of Intrepid and Jaguar, respectively. Compared with rbIO and coIO, the threading approach greatly improves the production performance of NekCEM on Blue Gene/P and Cray XK6 machines by significantly reducing the total simulation time from checkpoint blocking reduction. We also discuss the potential strength of this approach with the Scalable Checkpoint Restart library and on other full-featured operating systems such as that to be deployed on the upcoming Blue Gene/Q.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134485289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stepping towards noiseless Linux environment","authors":"Hakan Akkan, M. Lang, L. Liebrock","doi":"10.1145/2318916.2318925","DOIUrl":"https://doi.org/10.1145/2318916.2318925","url":null,"abstract":"Scientific applications are interrupted by the operating system far too often. Historically operating systems have been written efficiently to time-share a single resource, the CPU. We now have an abundance of cores but we are still swapping out the application to run other tasks and therefore increasing the application's time to solution. Current task scheduling in Linux is not tuned for a high performance computing environment, where a single job is running on all available cores. For example, checking for context switches hundreds of times per second is counter-productive in this setting.\u0000 One solution to this problem is to partition the cores between operating system and application; with the advent of many-core processors this approach is more attractive. This work describes our investigation of isolation of application processes from the operating system using a soft-partitioning scheme. We use increasingly invasive approaches; from configuration changes with available Linux features such as control groups and pinning interrupts using the CPU affinity settings, to invasive source level code changes to try to reduce, or in some cases completely eliminate, application interruptions such as OS clock ticks and timers.\u0000 Explained here are the measures that can be taken to reduce application interruption solely with compile and run time configurations in a recent unmodified Linux kernel. Although these measures have been available for a some time, to our knowledge, they have never been addressed in an HPC context. We then introduce our invasive method, where we remove the involuntary preemption induced by task scheduling. Our experiments show that parallel applications benefit from these modifications even at relatively small scales. At the modest scale of our testbed, we see a 1.72% improvement that should project into higher benefits at extreme scales.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114908754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The RAMDISK storage accelerator: a method of accelerating I/O performance on HPC systems using RAMDISKs","authors":"Tim Wickberg, C. Carothers","doi":"10.1145/2318916.2318922","DOIUrl":"https://doi.org/10.1145/2318916.2318922","url":null,"abstract":"I/O performance in large-scale HPC systems has not kept pace with improvements in computational performance. This widening gap presents an opportunity to introduce a new layer into the HPC environment that specifically targets this divide. A RAMDISK Storage Accelerator (RSA) is proposed; a system leveraging the high-throughput and decreasing cost of DRAM to provide an application-transparent method for pre-staging input data and commit results back to a persistent disk storage system.\u0000 The RSA is constructed from a set of individual RSA nodes; each with large amounts of DRAM and a high-speed connection to the storage network. Memory from each node is made available through a dynamically constructed parallel filesystem to a compute job; data is asynchronously staged on to the RAMDISK ahead of compute job start and written back out to the persistent disk system after job completion. The RAMDISK provides very-high-speed, low-latency temporary storage that is dedicated to a specific job. Asynchronous data-staging frees the compute system from time that would otherwise be spent waiting for file I/O to finish at the start and end of execution. The RSA Scheduler is constructed to demonstrate this asynchronous data-staging model.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124953427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abhishek Kulkarni, A. Lumsdaine, M. Lang, Latchesar Ionkov
{"title":"Optimizing latency and throughput for spawning processes on massively multicore processors","authors":"Abhishek Kulkarni, A. Lumsdaine, M. Lang, Latchesar Ionkov","doi":"10.1145/2318916.2318924","DOIUrl":"https://doi.org/10.1145/2318916.2318924","url":null,"abstract":"The execution of a SPMD application involves running multiple instances of a process with possibly varying arguments. With the widespread adoption of massively multicore processors, there has been a focus towards harnessing the abundant compute resources effectively in a power-efficient manner. Although much work has been done towards optimizing distributed process launch using hierarchical techniques, there has been a void in studying the performance of spawning processes within a single node. Reducing the latency to spawn a new process locally results in faster global job launch. Further, emerging dynamic and resilient execution models are designed on the premise of maintaining process pools for fault isolation and launching several processes in a relatively shorter period of time. Optimizing the latency and throughput for spawning processes would help improve the overall performance of runtime systems, allow adaptive process-replication reliability and motivate the design and implementation of process management interfaces in future manycore operating systems.\u0000 In this paper, we study the several limiting factors for efficient spawning of processes on massively multicore architectures. We have developed a library to optimize launching multiple instances of the same executable. Our microbenchmarks show a 20-80% decrease in the process spawn time for multiple executables. We further discuss the effects of memory locality and propose NUMA-aware extensions to optimize launching processes with large memory-mapped segments including dynamic shared libraries. Finally, we describe vector operating system interfaces for spawning a batch of processes from a given executable on specific cores. Our results show a 50x speedup over the traditional method of launching new processes using fork and exec system calls.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116964338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Better than native: using virtualization to improve compute node performance","authors":"Brian Kocoloski, J. Lange","doi":"10.1145/2318916.2318926","DOIUrl":"https://doi.org/10.1145/2318916.2318926","url":null,"abstract":"Modified variants of Linux are likely to be the underlying operating systems for future exascale platforms. Despite the many advantages of this approach, a subset of applications exist in which a lightweight kernel (LWK) based OS is needed and/or preferred. We contend that virtualization is capable of supporting LWKs as virtual machines (VMs) running at scale on top of a Linux environment. Furthermore, we claim that a properly designed virtual machine monitor (VMM) can provide an isolated and independent environment that avoids the overheads of the Linux host OS. To validate the feasibility of this approach we demonstrate that given a Linux host OS, benchmarks running in a virtualized LWK environment are capable of outperforming the same benchmarks executed directly on the Linux host.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"20 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114020812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}