Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers最新文献

GPUrdma: GPU-side library for high performance networking from GPU kernels GPUrdma: GPU端库，用于GPU内核的高性能网络

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2016-06-01 DOI: 10.1145/2931088.2931091

F. Daoud, Amir Wated, M. Silberstein

{"title":"GPUrdma: GPU-side library for high performance networking from GPU kernels","authors":"F. Daoud, Amir Wated, M. Silberstein","doi":"10.1145/2931088.2931091","DOIUrl":"https://doi.org/10.1145/2931088.2931091","url":null,"abstract":"We present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both control and data. Slow single-thread GPU performance and the intricacies of the GPU-to-network adapter interaction pose a significant challenge. We describe several design options and analyze their performance implications in detail. We achieve 5μsec one-way communication latency and up to 50Gbit/sec transfer bandwidth for messages from 16KB and larger between K40c NVIDIA GPUs across the network. Moreover, GPUrdma outperforms the CPU RDMA for smaller packets ranging from 2 to 1024 bytes by factor of 4.5x thanks to greater parallelism of transfer requests enabled by highly parallel GPU hardware. We use GPUrdma to implement a subset of the global address space programming interface (GPI) for point-to-point asynchronous RDMA messaging. We demonstrate our preliminary results using two simple applications -- ping-pong and a multi-matrix-vector product with constant matrix and multiple vectors -- each running on two different machines connected by Infiniband. Our basic ping-pong implementation achieves 5%higher performance than the baseline using GPI-2. The improved ping-pong implementation with per-threadblock communication overlap enables further 20% improvement. The multi-matrix-vector product is up to 4.5x faster thanks to higher throughput for small messages and the ability to keep the matrix in fast GPU shared memory while receiving new inputs. GPUrdma prototype is not yet suitable for production systems due to hardware constraints in the current generation of NVIDIA GPUs which we discuss in detail. However, our results highlight the great potential of GPU-side native networking, and encourage further research toward scalable, high-performance, heterogeneous networking infrastructure.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"19 35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130147686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 47

A Cross-Enclave Composition Mechanism for Exascale System Software 百亿亿级系统软件的跨enclave组合机制

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2016-06-01 DOI: 10.1145/2931088.2931094

N. Evans, K. Pedretti, Brian Kocoloski, J. Lange, M. Lang, P. Bridges

{"title":"A Cross-Enclave Composition Mechanism for Exascale System Software","authors":"N. Evans, K. Pedretti, Brian Kocoloski, J. Lange, M. Lang, P. Bridges","doi":"10.1145/2931088.2931094","DOIUrl":"https://doi.org/10.1145/2931088.2931094","url":null,"abstract":"As supercomputers move to exascale, the number of cores per node continues to increase, but the I/O bandwidth between nodes is increasing more slowly. This leads to computational power outstripping I/O bandwidth. This growth, in turn, encourages moving as much of an HPC workflow as possible onto the node in order to minimize data movement. One particular method of application composition, enclaves, co-locates different operating systems and runtimes on the same node where they communicate by in situ communication mechanisms. In this work, we describe a mechanism for communicating between composed applications. We implement a mechanism using Copy on Write cooperating with XEMEM shared memory to provide consistent, implicitly unsynchronized communication across enclaves. We then evaluate this mechanism using a composed application and analytics between the Kitten Lightweight Kernel and Linux on top of the Hobbes Operating System and Runtime. These results show a 3% overhead compared to an application running in isolation, demonstrating the viability of this approach.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133082216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Multi-Kernel Survey for High-Performance Computing 面向高性能计算的多内核综述

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2016-06-01 DOI: 10.1145/2931088.2931092

Balazs Gerofi, Y. Ishikawa, R. Riesen, R. Wisniewski, Yoonho Park, Bryan S. Rosenburg

{"title":"A Multi-Kernel Survey for High-Performance Computing","authors":"Balazs Gerofi, Y. Ishikawa, R. Riesen, R. Wisniewski, Yoonho Park, Bryan S. Rosenburg","doi":"10.1145/2931088.2931092","DOIUrl":"https://doi.org/10.1145/2931088.2931092","url":null,"abstract":"In HPC, two trends have led to the emergence and popularity of an operating-system approach in which multiple kernels are run simultaneously on each compute node. The first trend has been the increase in complexity of the HPC software environment, which has placed the traditional HPC kernel approaches under stress. Meanwhile, microprocessors with more and more cores are being produced, allowing specialization within a node. As is typical in an emerging field, different groups are considering many different approaches to deploying multi-kernels. In this paper we identify and describe a number of ongoing HPC multi-kernel efforts. Given the increasing number of choices for implementing and providing compute node kernel functionality, users and system designers will find value in understanding the differences among the kernels (and among the perspectives) of the different multi-kernel efforts. To that end, we provide a survey of approaches and qualitatively compare and contrast the alternatives. We identify a series of criteria that characterize the salient differences among the approaches, providing users and system designers with a common language for discussing the features of a design that are relevant for them. In addition to the set of criteria for characterizing multi-kernel architectures, the paper contributes a classification of current multi-kernel projects according to those criteria.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126490726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Decoupled: Low-Effort Noise-Free Execution on Commodity Systems 解耦:在商品系统上的低努力无噪声执行

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2016-06-01 DOI: 10.1145/2931088.2931095

A. Lackorzynski, C. Weinhold, Hermann Härtig

{"title":"Decoupled: Low-Effort Noise-Free Execution on Commodity Systems","authors":"A. Lackorzynski, C. Weinhold, Hermann Härtig","doi":"10.1145/2931088.2931095","DOIUrl":"https://doi.org/10.1145/2931088.2931095","url":null,"abstract":"Today's high-performance computing (HPC) landscape is dominated by clusters built from commodity hardware. The nodes of these systems are essentially x86-based servers that run an operating system (OS) derived from an enterprise Linux distribution. In contrast, previous generations of supercomputers ran OSes that were designed specifically for the needs of HPC applications. The migration from these special-purpose OSes to off-the-shelf system software brought many advantages for both vendors and users, most importantly reduced costs and a larger feature set. However, it also left behind an important property: jitter-free execution of parallel programs. This jitter, often called OS noise, causes slowdowns for many important applications and is expected to become a major obstacle to exascale computing. Therefore, several OS research projects aim at building light-weight kernels that provide HPC applications with a noise-free execution environment. Linux runs next to these new kernels and provides functionality that they (intentionally) do not implement. However, building these new kernels and all the required support infrastructure requires considerable development and maintenance effort. We argue that a noise-free HPC OS can be built upon existing components with much less effort. In this paper, we describe a node OS that combines an off-the-shelf microkernel with a virtualized Linux kernel that provides rich functionality, including device drivers. We extended these two building blocks with a simple mechanism to decouple program execution from noisy Linux. We evaluate our prototype on a recently installed InfiniBand cluster.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115895267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

HermitCore: A Unikernel for Extreme Scale Computing HermitCore:用于极端规模计算的单内核

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2016-06-01 DOI: 10.1145/2931088.2931093

Stefan Lankes, Simon Pickartz, Jens Breitbart

引用次数: 50

A Scalable Runtime for the ECOSCALE Heterogeneous Exascale Hardware Platform ECOSCALE异构Exascale硬件平台的可扩展运行时

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2016-06-01 DOI: 10.1145/2931088.2931090

P. Harvey, K. Bakanov, I. Spence, Dimitrios S. Nikolopoulos

{"title":"A Scalable Runtime for the ECOSCALE Heterogeneous Exascale Hardware Platform","authors":"P. Harvey, K. Bakanov, I. Spence, Dimitrios S. Nikolopoulos","doi":"10.1145/2931088.2931090","DOIUrl":"https://doi.org/10.1145/2931088.2931090","url":null,"abstract":"Exascale computation is the next target of high performance computing. In the push to create exascale computing platforms, simply increasing the number of hardware devices is not an acceptable option given the limitations of power consumption, heat dissipation, and programming models which are designed for current hardware platforms. Instead, new hardware technologies, coupled with improved programming abstractions and more autonomous runtime systems, are required to achieve this goal. This position paper presents the design of a new runtime for a new heterogeneous hardware platform being developed to explore energy efficient, high performance computing. By extending and enhancing the OpenCL framework, this work will both simplify the programming of current and future HPC applications, as well as automating the scheduling of data and computation across this new hardware platform. Also, this work explores the use of FPGAs to achieve both the power and performance goals of exascale, as well as utilising the runtime to automatically effect dynamic configuration and reconfiguration of hardware platforms.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123348255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Quest for Unified, Global View Parallel Programming Models for Our Future 为我们的未来寻求统一的全局视图并行编程模型

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 2016-06-01 DOI: 10.1145/2931088.2931089

K. Taura

{"title":"A Quest for Unified, Global View Parallel Programming Models for Our Future","authors":"K. Taura","doi":"10.1145/2931088.2931089","DOIUrl":"https://doi.org/10.1145/2931088.2931089","url":null,"abstract":"Developing highly scalable programs on today's HPC machines is becoming ever more challenging, due to decreasing byte-flops ratio, deepening memory/network hierarchies, and heterogeneity. Programmers need to learn a distinct programming API for each layer of the hierarchy and overcome performance issues at all layers, one at a time, when the underlying high-level principle for performance is in fact fairly common across layers---locality. Future programming models must allow the programmer to express locality and parallelism in high level terms and their implementation should map exposed parallelism onto different layers of the machine (nodes, cores, and vector units) efficiently by concerted efforts of compilers and runtime systems. In this talk, I will argue that a global view task parallel programming model is a promising direction toward this goal that can reconcile generality, programmability, and performance at a high level. I will then talk about our ongoing research efforts with this prospect. They include: MassiveThreads, a lightweight user-level thread package for multicore systems; MassiveThreads/DM, its extension to distributed memory machines; DAGViz, a performance analyzer specifically designed for task parallel programs; and a task-vectorizing compiler that transforms task parallel programs into vectorized and parallelized instructions. I will end by sharing our prospects on how emerging hardware features and fruitful co-design efforts may help achieve the challenging goal.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129565736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers 第六届超级计算机运行时和操作系统国际研讨会论文集

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers Pub Date : 1900-01-01 DOI: 10.1145/2931088

引用次数: 1