2010 IEEE International Conference on Cluster Computing最新文献_第3页

Designing OS for HPC Applications: Scheduling 为HPC应用程序设计操作系统:调度

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI: 10.1109/CLUSTER.2010.16

R. Gioiosa, S. Mckee, M. Valero

{"title":"Designing OS for HPC Applications: Scheduling","authors":"R. Gioiosa, S. Mckee, M. Valero","doi":"10.1109/CLUSTER.2010.16","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.16","url":null,"abstract":"Operating systems have historically been implemented as independent layers between hardware and applications. User programs communicate with the OS through a set of well defined system calls, and do not have direct access to the hardware. The OS, in turn, communicates with the underlying architecture via control registers. Except for these interfaces, the three layers are practically oblivious to each other. While this structure improves portability and transparency, it may not deliver optimal performance. This is especially true for High Performance Computing (HPC) systems, where modern parallel applications and multi-core architectures pose new challenges in terms of performance, power consumption, and system utilization. The hardware, the OS, and the applications can no longer remain isolated, and instead should cooperate to deliver high performance with minimal power consumption. In this paper we present our experience with the design and implementation of High Performance Linux (HPL), an operating system designed to optimize the performance of HPC applications running on a state-of-the-art compute cluster. We show how characterizing parallel applications through hardware and software performance counters drives the design of the OS and how including knowledge about the architecture improves performance and efficiency. We perform experiments on a dual-socket IBM POWER6 machine, showing performance improvements and stability (performance variation of 2.11% on average) for NAS, a widely used parallel benchmark suite.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130548494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters 基于gpgpu的簇上流张量收缩表达式的加速

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI: 10.1109/CLUSTER.2010.26

Wenjing Ma, S. Krishnamoorthy, Oreste Villa, K. Kowalski

引用次数: 23

How to Scale Nested OpenMP Applications on the ScaleMP vSMP Architecture 如何在ScaleMP vSMP架构上扩展嵌套OpenMP应用程序

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI: 10.1109/CLUSTER.2010.38

Dirk Schmidl, C. Terboven, A. Wolf, Dieter an Mey, C. Bischof

引用次数: 21

RDMA-Based Job Migration Framework for MPI over InfiniBand 基于rdma的MPI ib作业迁移框架

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI: 10.1109/CLUSTER.2010.20

Xiangyong Ouyang, Sonya Marcarelli, R. Rajachandrasekar, D. Panda

{"title":"RDMA-Based Job Migration Framework for MPI over InfiniBand","authors":"Xiangyong Ouyang, Sonya Marcarelli, R. Rajachandrasekar, D. Panda","doi":"10.1109/CLUSTER.2010.20","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.20","url":null,"abstract":"Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are restarted and restored to the latest checkpoint image. However, this kind of approach is unable to provide the scalability required by increasingly large-sized jobs, since it puts heavy I/O burden on the storage subsystem, and resubmitting a job during restart phase incurs lengthy queuing delay. In this paper, we enhance the fault tolerance of MVAPICH2, an open-source high performance MPI-2 implementation, by using a proactive job migration scheme. Instead of checkpointing all the processes of the job and saving their process images to a stable storage, we transfer the processes running on a health-deteriorating node to a healthy spare node, and resume these processes from the spare node. RDMA-based process image transmission is designed to take advantage of high performance communication in InfiniBand. Experimental results show that the Job Migration scheme can achieve a speedup of 4.49 times over the Checkpoint/Restart scheme to handle a node failure for a 64-process application running on 8 compute nodes. To the best of our knowledge, this is the first such job migration design for InfiniBand-based clusters.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116688862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect TCCluster:利用处理器主机接口作为网络互连的集群体系结构

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI: 10.1109/CLUSTER.2010.37

Heiner Litz, M. Thürmer, U. Brüning

{"title":"TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect","authors":"Heiner Litz, M. Thürmer, U. Brüning","doi":"10.1109/CLUSTER.2010.37","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.37","url":null,"abstract":"So far, large computing clusters consisting of several thousand machines have been constructed by connecting nodes together using interconnect technologies as e.g. Ethernet, Infiniband or Myrinet. We propose an entirely new architecture called Tightly Coupled Cluster (TCCluster) that instead uses the native host interface of the processors as a direct network interconnect. This approach offers higher bandwidth and much lower communication latencies than the traditional approaches by virtually integrating the network interface adapter into the processor. Our technique neither applies any modifications to the processor nor requires any additional hardware. Instead, we use commodity off the shelf AMD processors and exploit the HyperTransport host interface as a cluster interconnect. Our approach is purely software based and does not require any additional hardware nor modifications to the existing processors. In this paper, we explain the addressing of nodes in such a cluster, the routing within such a system and the programming model that can be applied. We present a detailed description of the tasks that need to be addressed and provide a proof of concept implementation. For the evaluation of our technique a two node TCCluster prototype is presented. Therefore, the BIOS firmware, a custom Linux kernel and a small message library has been developed. We present microbenchmarks that show a sustained bandwidth of up to 2500 MB/s for messages as small as 64 Byte and a communication latency of 227 ns between two nodes outperforming other high performance networks by an order of magnitude.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116855301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing 千万亿异构CPU/GPU计算的自适应优化

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI: 10.1109/CLUSTER.2010.12

Canqun Yang, Feng Wang, Yunfei Du, Juan Chen, Jie Liu, Huizhan Yi, Kai Lu

引用次数: 83

An Efficient Process Live Migration Mechanism for Load Balanced Distributed Virtual Environments 负载均衡分布式虚拟环境中一种高效的进程动态迁移机制

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI: 10.1109/CLUSTER.2010.25

Balazs Gerofi, H. Fujita, Y. Ishikawa

{"title":"An Efficient Process Live Migration Mechanism for Load Balanced Distributed Virtual Environments","authors":"Balazs Gerofi, H. Fujita, Y. Ishikawa","doi":"10.1109/CLUSTER.2010.25","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.25","url":null,"abstract":"Distributed virtual environments (DVE), such as multi-player online games and distributed simulations may involve a massive amount of concurrent clients. Deploying distributed server architectures is currently the most prevalent way of providing such large-scale services, where typically the virtual space is divided into several distinct regions requiring each server to handle only part of the virtual world. Inequalities in client distribution may, however, cause certain servers to become overloaded, which potentially degrades the interactivity of the environment and thus renders the load balancing problem a crucial issue. Prior research has shown several approaches for avoiding uneven workload, nevertheless, addressing the problem mainly at the application layer. In this paper we focus on solving the DVE load balancing problem at the operating system level. We propose an efficient process live migration mechanism, which is optimized for processes maintaining a massive amount of network connections. Building on top of it, we have implemented a decentralized middleware that instruments process migration among the cluster nodes, attempting to equalize loads on all machines. We demonstrate the performance of the live migration mechanism on a real-world multiplayer game server and show the behavior of the load balancing engine through a realistic DVE simulation.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132017540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Exploiting Data Deduplication to Accelerate Live Virtual Machine Migration 利用重复数据删除加速虚拟机迁移

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI: 10.1109/CLUSTER.2010.17

Xiang Zhang, Zhigang Huo, Jie Ma, Dan Meng

引用次数: 130

Host Side Dynamic Reconfiguration with InfiniBand 主机侧动态重新配置与ib

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI: 10.1109/CLUSTER.2010.21

Wei Lin Guay, Sven-Arne Reinemo, Olav Lysne, T. Skeie, Bjørn Dag Johnsen, Line Holen

引用次数: 10

A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications 科学应用中通信-计算重叠自动分析的仿真框架

2010 IEEE International Conference on Cluster Computing Pub Date : 2010-09-20 DOI: 10.1109/CLUSTER.2010.33

V. Subotic, J. Sancho, J. Labarta, M. Valero

{"title":"A Simulation Framework to Automatically Analyze the Communication-Computation Overlap in Scientific Applications","authors":"V. Subotic, J. Sancho, J. Labarta, M. Valero","doi":"10.1109/CLUSTER.2010.33","DOIUrl":"https://doi.org/10.1109/CLUSTER.2010.33","url":null,"abstract":"Overlapping communication and computation has been devised as an attractive technique to alleviate the huge application's network requirements at large scale. Overlapping will allow to fully or partially hide the long communication delays suffered when transferring messages through the network. This will relax the application's network requirements, and hence allow to deploy more cost-effective network designs. However, today's scientific applications make little use of overlapping. In addition, there is no support to analyze how overlap could impact the performance of real scientific applications. In this paper we address this issue by presenting a simulation framework to automatically analyze the benefits of communication-computation overlap. The simulation framework consists of a binary translation tool (Valgrind), a distributed machine simulator (Dimemas), and a visualization tool (Paraver). Valgrind instruments the legacy MPI application and generates the execution traces, then Dimemas uses the obtained traces and reconstructs the application's time-behavior on a configurable parallel platform, and finally Paraver visualizes the obtained time-behaviors. Our simulation methodology brings two new features into the study of overlap: 1) automatic simulation of the overlapped execution - as there is no need for code restructuring in applications; and 2) visualization of simulated time behaviors, that further allows useful comparisons of the non-overlapped and the overlapped executions.","PeriodicalId":152171,"journal":{"name":"2010 IEEE International Conference on Cluster Computing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116627695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10