{"title":"Designing OS for HPC Applications: Scheduling","authors":"R. Gioiosa, S. Mckee, M. Valero","doi":"10.1109/CLUSTER.2010.16","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.16","url":null,"abstract":"Operating systems have historically been implemented as independent layers between hardware and applications. User programs communicate with the OS through a set of well defined system calls, and do not have direct access to the hardware. The OS, in turn, communicates with the underlying architecture via control registers. Except for these interfaces, the three layers are practically oblivious to each other. While this structure improves portability and transparency, it may not deliver optimal performance. This is especially true for High Performance Computing (HPC) systems, where modern parallel applications and multi-core architectures pose new challenges in terms of performance, power consumption, and system utilization. The hardware, the OS, and the applications can no longer remain isolated, and instead should cooperate to deliver high performance with minimal power consumption. In this paper we present our experience with the design and implementation of High Performance Linux (HPL), an operating system designed to optimize the performance of HPC applications running on a state-of-the-art compute cluster. We show how characterizing parallel applications through hardware and software performance counters drives the design of the OS and how including knowledge about the architecture improves performance and efficiency. We perform experiments on a dual-socket IBM POWER6 machine, showing performance improvements and stability (performance variation of 2.11% on average) for NAS, a widely used parallel benchmark suite.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130548494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenjing Ma, S. Krishnamoorthy, Oreste Villa, K. Kowalski
{"title":"Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters","authors":"Wenjing Ma, S. Krishnamoorthy, Oreste Villa, K. Kowalski","doi":"10.1109/CLUSTER.2010.26","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.26","url":null,"abstract":"Tensor contractions are generalized multidimensional matrix multiplication operations that widely occur in quantum chemistry. Efficient execution of tensor contractions on GPUs requires tackling several challenges to be addressed, including index permutation and small dimension-sizes reducing thread block utilization. In this paper, we present our approach to automatically generate CUDA code to execute tensor contractions on GPUs, including management of data movement between CPU and GPU. GPU-enabled code is generated for the most expensive contractions in CCSD(T), a key coupled cluster method, and incorporated into NW Chem, a popular computational chemistry suite. We demonstrate speedup over a factor of 8.4 using one core per node and over 2.6 when utilizing the entire system using hybrid CPU+GPU solution with 2 GPUs and 5 cores. Finally, we analyze the implementation behavior on future GPU systems.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121881345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dirk Schmidl, C. Terboven, A. Wolf, Dieter an Mey, C. Bischof
{"title":"How to Scale Nested OpenMP Applications on the ScaleMP vSMP Architecture","authors":"Dirk Schmidl, C. Terboven, A. Wolf, Dieter an Mey, C. Bischof","doi":"10.1109/CLUSTER.2010.38","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.38","url":null,"abstract":"The novel ScaleMP vSMP architecture employs commodity x86-based servers with an InfiniBand network to assemble a large shared memory system at an attractive price point. We examine this combined hardware- and softwareapproach of a DSM system using both system-level kernel benchmarks as well as real-world application codes. We compare this architecture with traditional shared memory machines and elaborate on strategies to tune application codes parallelized with OpenMP on multiple levels. Finally we summarize the necessary conditions which a scalable application has to fulfill in order to profit from the full potential of the ScaleMP approach.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121332986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiangyong Ouyang, Sonya Marcarelli, R. Rajachandrasekar, D. Panda
{"title":"RDMA-Based Job Migration Framework for MPI over InfiniBand","authors":"Xiangyong Ouyang, Sonya Marcarelli, R. Rajachandrasekar, D. Panda","doi":"10.1109/CLUSTER.2010.20","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.20","url":null,"abstract":"Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are restarted and restored to the latest checkpoint image. However, this kind of approach is unable to provide the scalability required by increasingly large-sized jobs, since it puts heavy I/O burden on the storage subsystem, and resubmitting a job during restart phase incurs lengthy queuing delay. In this paper, we enhance the fault tolerance of MVAPICH2, an open-source high performance MPI-2 implementation, by using a proactive job migration scheme. Instead of checkpointing all the processes of the job and saving their process images to a stable storage, we transfer the processes running on a health-deteriorating node to a healthy spare node, and resume these processes from the spare node. RDMA-based process image transmission is designed to take advantage of high performance communication in InfiniBand. Experimental results show that the Job Migration scheme can achieve a speedup of 4.49 times over the Checkpoint/Restart scheme to handle a node failure for a 64-process application running on 8 compute nodes. To the best of our knowledge, this is the first such job migration design for InfiniBand-based clusters.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116688862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect","authors":"Heiner Litz, M. Thürmer, U. Brüning","doi":"10.1109/CLUSTER.2010.37","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.37","url":null,"abstract":"So far, large computing clusters consisting of several thousand machines have been constructed by connecting nodes together using interconnect technologies as e.g. Ethernet, Infiniband or Myrinet. We propose an entirely new architecture called Tightly Coupled Cluster (TCCluster) that instead uses the native host interface of the processors as a direct network interconnect. This approach offers higher bandwidth and much lower communication latencies than the traditional approaches by virtually integrating the network interface adapter into the processor. Our technique neither applies any modifications to the processor nor requires any additional hardware. Instead, we use commodity off the shelf AMD processors and exploit the HyperTransport host interface as a cluster interconnect. Our approach is purely software based and does not require any additional hardware nor modifications to the existing processors. In this paper, we explain the addressing of nodes in such a cluster, the routing within such a system and the programming model that can be applied. We present a detailed description of the tasks that need to be addressed and provide a proof of concept implementation. For the evaluation of our technique a two node TCCluster prototype is presented. Therefore, the BIOS firmware, a custom Linux kernel and a small message library has been developed. We present microbenchmarks that show a sustained bandwidth of up to 2500 MB/s for messages as small as 64 Byte and a communication latency of 227 ns between two nodes outperforming other high performance networks by an order of magnitude.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116855301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Canqun Yang, Feng Wang, Yunfei Du, Juan Chen, Jie Liu, Huizhan Yi, Kai Lu
{"title":"Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing","authors":"Canqun Yang, Feng Wang, Yunfei Du, Juan Chen, Jie Liu, Huizhan Yi, Kai Lu","doi":"10.1109/CLUSTER.2010.12","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.12","url":null,"abstract":"In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is presented to balance the workload distribution across the GPUs and CPUs with the negligible runtime overhead, resulting in the better performance than the static or the training partitioning methods. The CPU-GPU communication overhead is effectively hidden by a software pipelining technique, which is particularly useful for large memory-bound applications. Combined with other traditional optimizations, the Linpack we optimized using the adaptive optimization framework achieved 196.7 GFLOPS on a single compute element of TianHe-1. This result is 70.1% of the peak compute capability and 3.3 times faster than the result using the vendor’s library. On the full configuration of TianHe-1 our optimizations resulted in a Linpack performance of 0.563PFLOPS, which made TianHe-1 the 5th fastest supercomputer on the Top500 list released in November 2009.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125766337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Efficient Process Live Migration Mechanism for Load Balanced Distributed Virtual Environments","authors":"Balazs Gerofi, H. Fujita, Y. Ishikawa","doi":"10.1109/CLUSTER.2010.25","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.25","url":null,"abstract":"Distributed virtual environments (DVE), such as multi-player online games and distributed simulations may involve a massive amount of concurrent clients. Deploying distributed server architectures is currently the most prevalent way of providing such large-scale services, where typically the virtual space is divided into several distinct regions requiring each server to handle only part of the virtual world. Inequalities in client distribution may, however, cause certain servers to become overloaded, which potentially degrades the interactivity of the environment and thus renders the load balancing problem a crucial issue. Prior research has shown several approaches for avoiding uneven workload, nevertheless, addressing the problem mainly at the application layer. In this paper we focus on solving the DVE load balancing problem at the operating system level. We propose an efficient process live migration mechanism, which is optimized for processes maintaining a massive amount of network connections. Building on top of it, we have implemented a decentralized middleware that instruments process migration among the cluster nodes, attempting to equalize loads on all machines. We demonstrate the performance of the live migration mechanism on a real-world multiplayer game server and show the behavior of the load balancing engine through a realistic DVE simulation.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132017540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration","authors":"Xiang Zhang, Zhigang Huo, Jie Ma, Dan Meng","doi":"10.1109/CLUSTER.2010.17","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.17","url":null,"abstract":"As one of the key characteristics of virtualization, live virtual machine (VM) migration provides great benefits for load balancing, power management, fault tolerance and other system maintenance issues in modern clusters and data centers. Although Pre-Copy is a widespread used migration algorithm, it does transfer a lot of duplicated memory image data from source to destination, which results in longer migration time and downtime. This paper proposes a novel VM migration approach, named Migration with Data Deduplication (MDD), which introduces data deduplication into migration. MDD utilizes the self-similarity of run-time memory image, uses hash based fingerprints to find identical and similar memory pages, and employs Run Length Encode (RLE) to eliminate redundant memory data during migration. Experiment demonstrates that compared with Xen's default Pre-Copy migration algorithm, MDD can reduce 56.60% of total data transferred during migration, 34.93% of total migration time, and 26.16% of downtime on average.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132882065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Lin Guay, Sven-Arne Reinemo, Olav Lysne, T. Skeie, Bjørn Dag Johnsen, Line Holen
{"title":"Host Side Dynamic Reconfiguration with InfiniBand","authors":"Wei Lin Guay, Sven-Arne Reinemo, Olav Lysne, T. Skeie, Bjørn Dag Johnsen, Line Holen","doi":"10.1109/CLUSTER.2010.21","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.21","url":null,"abstract":"Rerouting around faulty components and migration of jobs both require reconfiguration of data structures in the Queue Pairs residing in the hosts on an InfiniBand cluster. In this paper we report an implementation of dynamic reconfiguration of such host side data-structures. Our implementation preserves the Queue Pairs, and lets the application run without being interrupted. With this implementation, we demonstrate a complete solution to fault tolerance in an InfiniBand network, where dynamic network reconfiguration to a topology-agnostic routing function is used to avoid malfunctioning components. This solution is in principle able to let applications run uninterruptedly on the cluster, as long as the topology is physically connected. Through measurements on our test-cluster we show that the increased cost of our method in setup latency is negligible, and that there is only a minor reduction in throughput during reconfiguration.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"26 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120994013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications","authors":"V. Subotic, J. Sancho, J. Labarta, M. Valero","doi":"10.1109/CLUSTER.2010.33","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.33","url":null,"abstract":"Overlapping communication and computation has been devised as an attractive technique to alleviate the huge application's network requirements at large scale. Overlapping will allow to fully or partially hide the long communication delays suffered when transferring messages through the network. This will relax the application's network requirements, and hence allow to deploy more cost-effective network designs. However, today's scientific applications make little use of overlapping. In addition, there is no support to analyze how overlap could impact the performance of real scientific applications. In this paper we address this issue by presenting a simulation framework to automatically analyze the benefits of communication-computation overlap. The simulation framework consists of a binary translation tool (Valgrind), a distributed machine simulator (Dimemas), and a visualization tool (Paraver). Valgrind instruments the legacy MPI application and generates the execution traces, then Dimemas uses the obtained traces and reconstructs the application's time-behavior on a configurable parallel platform, and finally Paraver visualizes the obtained time-behaviors. Our simulation methodology brings two new features into the study of overlap: 1) automatic simulation of the overlapped execution - as there is no need for code restructuring in applications; and 2) visualization of simulated time behaviors, that further allows useful comparisons of the non-overlapped and the overlapped executions.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116627695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}