2011 IEEE International Conference on Cluster Computing最新文献

Application I/O and Data Management 应用I/O和数据管理

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.84

W. Dai

引用次数: 1

Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications 在容错HPC应用程序中优化消息记录的动态负载平衡

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.39

Esteban Meneses, L. Kalé, G. Bronevetsky

{"title":"Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications","authors":"Esteban Meneses, L. Kalé, G. Bronevetsky","doi":"10.1109/CLUSTER.2011.39","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.39","url":null,"abstract":"Computing systems will grow significantly larger in the near future to satisfy the needs of computational scientists in areas like climate modeling, biophysics and cosmology. Supercomputers being installed in the next few years will comprise millions of cores, hundreds of thousands of processor chips and millions of physical components. However, it is expected that failures become more prevalent in those machines to the point where 10% of an Exascale system will be wasted just recovering from failures. Further, with such large numbers of cores, fine-grained and dynamic load balance will become increasingly critical for maintaining good system utilization. This paper addresses both fault tolerance and load balancing by presenting a novel extension of traditional message logging protocols based on team check pointing. Message logging makes it possible to recover from localized failures by rolling back just the failed processing elements. Since this comes at a high memory overhead from logging all communication, we reduce this cost by organizing processing elements into teams and only logging messages between teams. Further, we show how to dynamically partition the application into teams to simultaneously minimize the cost of fault tolerance and to balance application load. We experimentally show that this scheme has low overhead and can dramatically reduce the memory cost of message logging.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134191048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Performance Emulation of Cell-Based AMR Cosmology Simulations 基于细胞的AMR宇宙学仿真性能仿真

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.10

Jingjin Wu, R. González, Z. Lan, N. Gnedin, A. Kravtsov, D. Rudd, Yongen Yu

引用次数: 10

Performance Behavior Prediction Scheme for Shared-Memory Parallel Applications 共享内存并行应用程序的性能行为预测方案

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.58

J. Corredor, J. Moure, Dolores Rexachs, Daniel Franco, E. Luque

引用次数: 0

RDMA Based Replication of Multiprocessor Virtual Machines over High-Performance Interconnects 高性能互连中基于RDMA的多处理器虚拟机复制

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.13

Balazs Gerofi, Y. Ishikawa

{"title":"RDMA Based Replication of Multiprocessor Virtual Machines over High-Performance Interconnects","authors":"Balazs Gerofi, Y. Ishikawa","doi":"10.1109/CLUSTER.2011.13","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.13","url":null,"abstract":"With the growing prevalence of cloud computing and the increasing number of CPU cores in modern processors, symmetric multiprocessing (SMP) Virtual Machines (VM), i.e. virtual machines with multiple virtual CPUs, are gaining significance. However, accommodating SMP virtual machines with high availability at low overhead is still an open problem. Checkpoint-recovery based VM replication is an emerging approach, but it comes with the price of significant performance degradation of the application executed in the VM due to the large amount of state that needs to be synchronized between the primary and the backup machines. Advanced features of high performance interconnects, such as Remote Direct Memory Access (RDMA), on the other hand, offer extreme network throughput. As such feature may provide an opportunity for acceptable performance degradation even for multi-core replicated virtual machines, the impact of such technologies in the domain of VM replication is important to assess. In this paper, we take a first look at the performance advantages of RDMA for SMP virtual machine replication. Moreover, in order to alleviate VM downtime during replication, we propose fine-grained copy-on-write (COW), which protects only memory pages that need to be transferred to the backup host allowing simultaneous execution of the VM with the replication. We find that the performance of replicated virtual machines over high performance interconnects scales well with the number of vCPUs in multiprocessor virtual machines, and that RDMA based replication in conjunction with fine-grained COW imposes acceptable overhead compared to the native VM execution when applied to virtual machines with up to 16 vCPUs.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131262514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

FastQuery: A Parallel Indexing System for Scientific Data FastQuery:一个科学数据的并行索引系统

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.86

J. Chou, Kesheng Wu, Prabhat

{"title":"FastQuery: A Parallel Indexing System for Scientific Data","authors":"J. Chou, Kesheng Wu, Prabhat","doi":"10.1109/CLUSTER.2011.86","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.86","url":null,"abstract":"Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies such as FastBit can significantly improve accesses to these datasets by augmenting the user data with indexes and other secondary information. However, a challenge is that the indexes assume the relational data model but the scientific data generally follows the array data model. To match the two data models, we design a generic mapping mechanism and implement an efficient input and output interface for reading and writing the data and their corresponding indexes. To take advantage of the emerging many-core architectures, we also develop a parallel strategy for indexing using threading technology. This approach complements our on-going MPI-based parallelization efforts. We demonstrate the flexibility of our software by applying it to two of the most commonly used scientific data formats, HDF5 and NetCDF. We present two case studies using data from a particle accelerator model and a global climate model. We also conducted a detailed performance study using these scientific datasets. The results show that FastQuery speeds up the query time by a factor of 2.5x to 50x, and it reduces the indexing time by a factor of 16 on 24 cores.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116196162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Locality-Aware Parallel Process Mapping for Multi-core HPC Systems 多核高性能计算系统的位置感知并行进程映射

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.59

Joshua Hursey, J. Squyres, T. Dontje

{"title":"Locality-Aware Parallel Process Mapping for Multi-core HPC Systems","authors":"Joshua Hursey, J. Squyres, T. Dontje","doi":"10.1109/CLUSTER.2011.59","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.59","url":null,"abstract":"High Performance Computing (HPC) systems are composed of servers containing an ever-increasing number of cores. With such high processor core counts, non-uniform memory access (NUMA) architectures are almost universally used to reduce inter-processor and memory communication bottlenecks by distributing processors and memory throughout a server-internal networking topology. Application studies have shown that the tuning of processes placement in a server's NUMA networking topology to the application can have a dramatic impact on performance. The performance implications are magnified when running a parallel job across multiple server nodes, especially with large scale HPC applications. This paper presents the Locality-Aware Mapping Algorithm (LAMA) for distributing the individual processes of a parallel application across processing resources in an HPC system, paying particular attention to the internal server NUMA topologies. The algorithm is able to support both homogeneous and heterogeneous hardware systems, and dynamically adapts to the available hardware and user-specified process layout at run-time. As implemented in Open MPI, the LAMA provides 362,880 mapping permutations and is able to naturally scale out to additional hardware resources as they become available in future architectures.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117093312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Supporting Computing Element Heterogeneity in P2P Grids 支持P2P网格中计算元素的异构性

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.25

J. Lee, P. Keleher, A. Sussman

引用次数: 4

Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System Blue Gene/P系统中应用级检查点的并行I/O性能

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.81

Jing Fu, M. Min, R. Latham, C. Carothers

{"title":"Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System","authors":"Jing Fu, M. Min, R. Latham, C. Carothers","doi":"10.1109/CLUSTER.2011.81","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.81","url":null,"abstract":"As the number of processors increases to hundreds of thousands in parallel computer architectures, the failure probability rises correspondingly, making fault tolerance a highly important and challenging task. Application-level checkpointing is one of the most popular techniques to proactively deal with unexpected failures because of its portability and flexibility. During the checkpoint phase, the local states of the computation spread across thousands of processors are saved to stable storage. Unfortunately, this approach results in heavy I/O load and can cause an I/O bottleneck in a massively parallel system. In this paper, we examine application-level checkpointing for a massively parallel electromagnetic solver system called NekCEM on the IBM Blue Gene/P at Argonne National Laboratory. We discuss an application-level, two-phase I/O approach, called ‚Äúreduced-blocking I/O‚Äù (rbIO), and a tuned MPI-IO collective approach (coIO), and we demonstrate their performance advantage over the ‚Äú1 POSIX file per processor‚Äù approach. Our study shows that rbIO and coIO result in 100vó improvement over previous checkpointing approaches on up to 65,536 processors of the Blue Gene/P using the GPFS. Our study also demonstrates a 25vó production performance improvement for NekCEM. We show how to optimize parameter settings for those parallel I/O approaches and to verify results by I/O profilings. In particular, we examine the performance advantage of rbIO and demonstrate the potential benefits of this approach over the traditional MPI-IO routine, coIO.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"6 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130013642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 GPU集群的优化非连续MPI数据类型通信:MVAPICH2的设计，实现和评估

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.42

Hao Wang, S. Potluri, Miao Luo, A. Singh, Xiangyong Ouyang, S. Sur, D. Panda

{"title":"Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2","authors":"Hao Wang, S. Potluri, Miao Luo, A. Singh, Xiangyong Ouyang, S. Sur, D. Panda","doi":"10.1109/CLUSTER.2011.42","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.42","url":null,"abstract":"Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remains the biggest hurdle to overall performance and programmer productivity. Real scientific applications utilize multi-dimensional data. Data in higher dimensions may not be contiguous in memory. In order to improve programmer productivity and to enable communication libraries to optimize non-contiguous data communication, the MPI interface provides MPI data types. Currently, state of the art MPI libraries do not provide native data type support for data that resides in GPU memory. The management of non-contiguous GPU data is a source of productivity and performance loss, because GPU application developers have to manually move the data out of and in to GPUs. In this paper, we present our design for enabling high-performance communication support between GPUs for non-contiguous data types. We describe our innovative approach to improve performance by \"offloading\" data type packing and unpacking on to a GPU device, and \"pipelining\" all data transfer stages between two GPUs. Our design is integrated into the popular MVAPICH2 MPI library for InfiniBand, iWARP and RoCE clusters. We perform a detailed evaluation of our design on a GPU cluster with the latest NVIDIA Fermi GPU adapters. The evaluation reveals that the proposed designs can achieve up to 88% latency improvement for vector data type at 4 MB size with micro benchmarks. For Stencil2D application from the SHOC benchmark suite, our design can simplify the data communication in its main loop, reducing the lines of code by 36%. Further, our method can improve the performance of Stencil2D by up to 42% for single precision data set, and 39% for double precision data set. To the best of our knowledge, this is the first such design, implementation and evaluation of non-contiguous MPI data communication for GPU clusters.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"170 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131318598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55