{"title":"GPUrdma: GPU-side library for high performance networking from GPU kernels","authors":"F. Daoud, Amir Wated, M. Silberstein","doi":"10.1145/2931088.2931091","DOIUrl":"https://doi.org/10.1145/2931088.2931091","url":null,"abstract":"We present GPUrdma, a GPU-side library for performing Remote Direct Memory Accesses (RDMA) across the network directly from GPU kernels. The library executes no code on CPU, directly accessing the Host Channel Adapter (HCA) Infiniband hardware for both control and data. Slow single-thread GPU performance and the intricacies of the GPU-to-network adapter interaction pose a significant challenge. We describe several design options and analyze their performance implications in detail. We achieve 5μsec one-way communication latency and up to 50Gbit/sec transfer bandwidth for messages from 16KB and larger between K40c NVIDIA GPUs across the network. Moreover, GPUrdma outperforms the CPU RDMA for smaller packets ranging from 2 to 1024 bytes by factor of 4.5x thanks to greater parallelism of transfer requests enabled by highly parallel GPU hardware. We use GPUrdma to implement a subset of the global address space programming interface (GPI) for point-to-point asynchronous RDMA messaging. We demonstrate our preliminary results using two simple applications -- ping-pong and a multi-matrix-vector product with constant matrix and multiple vectors -- each running on two different machines connected by Infiniband. Our basic ping-pong implementation achieves 5%higher performance than the baseline using GPI-2. The improved ping-pong implementation with per-threadblock communication overlap enables further 20% improvement. The multi-matrix-vector product is up to 4.5x faster thanks to higher throughput for small messages and the ability to keep the matrix in fast GPU shared memory while receiving new inputs. GPUrdma prototype is not yet suitable for production systems due to hardware constraints in the current generation of NVIDIA GPUs which we discuss in detail. However, our results highlight the great potential of GPU-side native networking, and encourage further research toward scalable, high-performance, heterogeneous networking infrastructure.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"19 35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130147686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Evans, K. Pedretti, Brian Kocoloski, J. Lange, M. Lang, P. Bridges
{"title":"A Cross-Enclave Composition Mechanism for Exascale System Software","authors":"N. Evans, K. Pedretti, Brian Kocoloski, J. Lange, M. Lang, P. Bridges","doi":"10.1145/2931088.2931094","DOIUrl":"https://doi.org/10.1145/2931088.2931094","url":null,"abstract":"As supercomputers move to exascale, the number of cores per node continues to increase, but the I/O bandwidth between nodes is increasing more slowly. This leads to computational power outstripping I/O bandwidth. This growth, in turn, encourages moving as much of an HPC workflow as possible onto the node in order to minimize data movement. One particular method of application composition, enclaves, co-locates different operating systems and runtimes on the same node where they communicate by in situ communication mechanisms. In this work, we describe a mechanism for communicating between composed applications. We implement a mechanism using Copy on Write cooperating with XEMEM shared memory to provide consistent, implicitly unsynchronized communication across enclaves. We then evaluate this mechanism using a composed application and analytics between the Kitten Lightweight Kernel and Linux on top of the Hobbes Operating System and Runtime. These results show a 3% overhead compared to an application running in isolation, demonstrating the viability of this approach.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133082216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Balazs Gerofi, Y. Ishikawa, R. Riesen, R. Wisniewski, Yoonho Park, Bryan S. Rosenburg
{"title":"A Multi-Kernel Survey for High-Performance Computing","authors":"Balazs Gerofi, Y. Ishikawa, R. Riesen, R. Wisniewski, Yoonho Park, Bryan S. Rosenburg","doi":"10.1145/2931088.2931092","DOIUrl":"https://doi.org/10.1145/2931088.2931092","url":null,"abstract":"In HPC, two trends have led to the emergence and popularity of an operating-system approach in which multiple kernels are run simultaneously on each compute node. The first trend has been the increase in complexity of the HPC software environment, which has placed the traditional HPC kernel approaches under stress. Meanwhile, microprocessors with more and more cores are being produced, allowing specialization within a node. As is typical in an emerging field, different groups are considering many different approaches to deploying multi-kernels. In this paper we identify and describe a number of ongoing HPC multi-kernel efforts. Given the increasing number of choices for implementing and providing compute node kernel functionality, users and system designers will find value in understanding the differences among the kernels (and among the perspectives) of the different multi-kernel efforts. To that end, we provide a survey of approaches and qualitatively compare and contrast the alternatives. We identify a series of criteria that characterize the salient differences among the approaches, providing users and system designers with a common language for discussing the features of a design that are relevant for them. In addition to the set of criteria for characterizing multi-kernel architectures, the paper contributes a classification of current multi-kernel projects according to those criteria.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126490726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Decoupled: Low-Effort Noise-Free Execution on Commodity Systems","authors":"A. Lackorzynski, C. Weinhold, Hermann Härtig","doi":"10.1145/2931088.2931095","DOIUrl":"https://doi.org/10.1145/2931088.2931095","url":null,"abstract":"Today's high-performance computing (HPC) landscape is dominated by clusters built from commodity hardware. The nodes of these systems are essentially x86-based servers that run an operating system (OS) derived from an enterprise Linux distribution. In contrast, previous generations of supercomputers ran OSes that were designed specifically for the needs of HPC applications. The migration from these special-purpose OSes to off-the-shelf system software brought many advantages for both vendors and users, most importantly reduced costs and a larger feature set. However, it also left behind an important property: jitter-free execution of parallel programs. This jitter, often called OS noise, causes slowdowns for many important applications and is expected to become a major obstacle to exascale computing. Therefore, several OS research projects aim at building light-weight kernels that provide HPC applications with a noise-free execution environment. Linux runs next to these new kernels and provides functionality that they (intentionally) do not implement. However, building these new kernels and all the required support infrastructure requires considerable development and maintenance effort. We argue that a noise-free HPC OS can be built upon existing components with much less effort. In this paper, we describe a node OS that combines an off-the-shelf microkernel with a virtualized Linux kernel that provides rich functionality, including device drivers. We extended these two building blocks with a simple mechanism to decouple program execution from noisy Linux. We evaluate our prototype on a recently installed InfiniBand cluster.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115895267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HermitCore: A Unikernel for Extreme Scale Computing","authors":"Stefan Lankes, Simon Pickartz, Jens Breitbart","doi":"10.1145/2931088.2931093","DOIUrl":"https://doi.org/10.1145/2931088.2931093","url":null,"abstract":"We expect that the size and the complexity of future supercomputers will increase on their path to exascale systems and beyond. Therefore, system software has to adapt to the complexity of these systems for a simplification of the development of scalable applications. In this paper, we present a unikernel operating system design for HPC. It extends the multi-kernel approach while providing better programmability and scalability for hierarchical systems, such as HLRS' Hazel Hen, which base on multiple cluster-on-a-chip processors. We prove the scalability of the design via micro benchmarks by taking the example of HermitCore---our prototype implementation of the new design.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127690872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Harvey, K. Bakanov, I. Spence, Dimitrios S. Nikolopoulos
{"title":"A Scalable Runtime for the ECOSCALE Heterogeneous Exascale Hardware Platform","authors":"P. Harvey, K. Bakanov, I. Spence, Dimitrios S. Nikolopoulos","doi":"10.1145/2931088.2931090","DOIUrl":"https://doi.org/10.1145/2931088.2931090","url":null,"abstract":"Exascale computation is the next target of high performance computing. In the push to create exascale computing platforms, simply increasing the number of hardware devices is not an acceptable option given the limitations of power consumption, heat dissipation, and programming models which are designed for current hardware platforms. Instead, new hardware technologies, coupled with improved programming abstractions and more autonomous runtime systems, are required to achieve this goal. This position paper presents the design of a new runtime for a new heterogeneous hardware platform being developed to explore energy efficient, high performance computing. By extending and enhancing the OpenCL framework, this work will both simplify the programming of current and future HPC applications, as well as automating the scheduling of data and computation across this new hardware platform. Also, this work explores the use of FPGAs to achieve both the power and performance goals of exascale, as well as utilising the runtime to automatically effect dynamic configuration and reconfiguration of hardware platforms.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123348255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Quest for Unified, Global View Parallel Programming Models for Our Future","authors":"K. Taura","doi":"10.1145/2931088.2931089","DOIUrl":"https://doi.org/10.1145/2931088.2931089","url":null,"abstract":"Developing highly scalable programs on today's HPC machines is becoming ever more challenging, due to decreasing byte-flops ratio, deepening memory/network hierarchies, and heterogeneity. Programmers need to learn a distinct programming API for each layer of the hierarchy and overcome performance issues at all layers, one at a time, when the underlying high-level principle for performance is in fact fairly common across layers---locality. Future programming models must allow the programmer to express locality and parallelism in high level terms and their implementation should map exposed parallelism onto different layers of the machine (nodes, cores, and vector units) efficiently by concerted efforts of compilers and runtime systems. In this talk, I will argue that a global view task parallel programming model is a promising direction toward this goal that can reconcile generality, programmability, and performance at a high level. I will then talk about our ongoing research efforts with this prospect. They include: MassiveThreads, a lightweight user-level thread package for multicore systems; MassiveThreads/DM, its extension to distributed memory machines; DAGViz, a performance analyzer specifically designed for task parallel programs; and a task-vectorizing compiler that transforms task parallel programs into vectorized and parallelized instructions. I will end by sharing our prospects on how emerging hardware features and fruitful co-design efforts may help achieve the challenging goal.","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129565736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","authors":"","doi":"10.1145/2931088","DOIUrl":"https://doi.org/10.1145/2931088","url":null,"abstract":"","PeriodicalId":262414,"journal":{"name":"Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126410404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}