2011 IEEE International Conference on Cluster Computing最新文献

筛选
英文 中文
Application I/O and Data Management 应用I/O和数据管理
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.84
W. Dai
{"title":"Application I/O and Data Management","authors":"W. Dai","doi":"10.1109/CLUSTER.2011.84","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.84","url":null,"abstract":"A library called HIO has been developed for large-scale multi-physics simulations, based on the UDM library [10]. The goal of the library is to provide sustainable, interoperable, efficient, scalable, and convenient tools for parallel IO and data management for high-level data structures in applications, and to provide tools for the connection between applications. The high-level data structures include one- and multi-dimensional arrays, structured meshes, unstructured meshes, and the meshes generated through adaptive mesh refinement. The IO mechanism can be collective and non-collective. The data objects suitable for the library could be either large or small data sets. Even for small data sets, the IO performance is close to one of MPI-IO performance.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126933977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications 在容错HPC应用程序中优化消息记录的动态负载平衡
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.39
Esteban Meneses, L. Kalé, G. Bronevetsky
{"title":"Dynamic Load Balance for Optimized Message Logging in Fault Tolerant HPC Applications","authors":"Esteban Meneses, L. Kalé, G. Bronevetsky","doi":"10.1109/CLUSTER.2011.39","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.39","url":null,"abstract":"Computing systems will grow significantly larger in the near future to satisfy the needs of computational scientists in areas like climate modeling, biophysics and cosmology. Supercomputers being installed in the next few years will comprise millions of cores, hundreds of thousands of processor chips and millions of physical components. However, it is expected that failures become more prevalent in those machines to the point where 10% of an Exascale system will be wasted just recovering from failures. Further, with such large numbers of cores, fine-grained and dynamic load balance will become increasingly critical for maintaining good system utilization. This paper addresses both fault tolerance and load balancing by presenting a novel extension of traditional message logging protocols based on team check pointing. Message logging makes it possible to recover from localized failures by rolling back just the failed processing elements. Since this comes at a high memory overhead from logging all communication, we reduce this cost by organizing processing elements into teams and only logging messages between teams. Further, we show how to dynamically partition the application into teams to simultaneously minimize the cost of fault tolerance and to balance application load. We experimentally show that this scheme has low overhead and can dramatically reduce the memory cost of message logging.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134191048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Performance Emulation of Cell-Based AMR Cosmology Simulations 基于细胞的AMR宇宙学仿真性能仿真
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.10
Jingjin Wu, R. González, Z. Lan, N. Gnedin, A. Kravtsov, D. Rudd, Yongen Yu
{"title":"Performance Emulation of Cell-Based AMR Cosmology Simulations","authors":"Jingjin Wu, R. González, Z. Lan, N. Gnedin, A. Kravtsov, D. Rudd, Yongen Yu","doi":"10.1109/CLUSTER.2011.10","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.10","url":null,"abstract":"Cosmological simulations are highly complicated, and it is time-consuming to redesign and reimplement the code for improvement. Moreover, it is a risk to implement any idea directly in the code without knowing its effects on performance. In this paper, we design an emulator for cell-based AMR (adaptive mesh refinement) cosmology simulations, in particular, the Adaptive Refinement Tree (ART) application. ART is an advanced \"hydro+N-body\" simulation tool integrating extensive physics processes for cosmological research. The emulator is designed based on the behaviors of cell-based AMR cosmology simulations, and quantitative performance models are built toward the design of the emulator. Our experiments with realistic cosmology simulations on production supercomputers indicate that the emulator is accurate. Moreover, we evaluate and compare three different load balancing schemes for cell-based cosmology simulations via the emulator. The comparison results provide us useful insight into the performance and scalability of different load balance schemes.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131585238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Performance Behavior Prediction Scheme for Shared-Memory Parallel Applications 共享内存并行应用程序的性能行为预测方案
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.58
J. Corredor, J. Moure, Dolores Rexachs, Daniel Franco, E. Luque
{"title":"Performance Behavior Prediction Scheme for Shared-Memory Parallel Applications","authors":"J. Corredor, J. Moure, Dolores Rexachs, Daniel Franco, E. Luque","doi":"10.1109/CLUSTER.2011.58","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.58","url":null,"abstract":"A current challenge in computing centers with different clusters to run applications is which multicore systems must we choose to run a given shared-memory parallel application. Our proposal is to generate a node performance profile database (NPPDB), composed by performance profiles given by distinct micro benchmark-target node combination. Then, applications are executed on a base node to identify different execution phases and their weights, and to collect performance and functional data for each phase. For similarity, the information to compare behavior is always obtained on the same node. When we want to project performance behavior, we look for similarity using the information from the performance profiles database with the phase characterization, in order to select the appropriate node for running the application.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130728762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RDMA Based Replication of Multiprocessor Virtual Machines over High-Performance Interconnects 高性能互连中基于RDMA的多处理器虚拟机复制
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.13
Balazs Gerofi, Y. Ishikawa
{"title":"RDMA Based Replication of Multiprocessor Virtual Machines over High-Performance Interconnects","authors":"Balazs Gerofi, Y. Ishikawa","doi":"10.1109/CLUSTER.2011.13","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.13","url":null,"abstract":"With the growing prevalence of cloud computing and the increasing number of CPU cores in modern processors, symmetric multiprocessing (SMP) Virtual Machines (VM), i.e. virtual machines with multiple virtual CPUs, are gaining significance. However, accommodating SMP virtual machines with high availability at low overhead is still an open problem. Checkpoint-recovery based VM replication is an emerging approach, but it comes with the price of significant performance degradation of the application executed in the VM due to the large amount of state that needs to be synchronized between the primary and the backup machines. Advanced features of high performance interconnects, such as Remote Direct Memory Access (RDMA), on the other hand, offer extreme network throughput. As such feature may provide an opportunity for acceptable performance degradation even for multi-core replicated virtual machines, the impact of such technologies in the domain of VM replication is important to assess. In this paper, we take a first look at the performance advantages of RDMA for SMP virtual machine replication. Moreover, in order to alleviate VM downtime during replication, we propose fine-grained copy-on-write (COW), which protects only memory pages that need to be transferred to the backup host allowing simultaneous execution of the VM with the replication. We find that the performance of replicated virtual machines over high performance interconnects scales well with the number of vCPUs in multiprocessor virtual machines, and that RDMA based replication in conjunction with fine-grained COW imposes acceptable overhead compared to the native VM execution when applied to virtual machines with up to 16 vCPUs.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131262514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
FastQuery: A Parallel Indexing System for Scientific Data FastQuery:一个科学数据的并行索引系统
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.86
J. Chou, Kesheng Wu, Prabhat
{"title":"FastQuery: A Parallel Indexing System for Scientific Data","authors":"J. Chou, Kesheng Wu, Prabhat","doi":"10.1109/CLUSTER.2011.86","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.86","url":null,"abstract":"Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies such as FastBit can significantly improve accesses to these datasets by augmenting the user data with indexes and other secondary information. However, a challenge is that the indexes assume the relational data model but the scientific data generally follows the array data model. To match the two data models, we design a generic mapping mechanism and implement an efficient input and output interface for reading and writing the data and their corresponding indexes. To take advantage of the emerging many-core architectures, we also develop a parallel strategy for indexing using threading technology. This approach complements our on-going MPI-based parallelization efforts. We demonstrate the flexibility of our software by applying it to two of the most commonly used scientific data formats, HDF5 and NetCDF. We present two case studies using data from a particle accelerator model and a global climate model. We also conducted a detailed performance study using these scientific datasets. The results show that FastQuery speeds up the query time by a factor of 2.5x to 50x, and it reduces the indexing time by a factor of 16 on 24 cores.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116196162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Locality-Aware Parallel Process Mapping for Multi-core HPC Systems 多核高性能计算系统的位置感知并行进程映射
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.59
Joshua Hursey, J. Squyres, T. Dontje
{"title":"Locality-Aware Parallel Process Mapping for Multi-core HPC Systems","authors":"Joshua Hursey, J. Squyres, T. Dontje","doi":"10.1109/CLUSTER.2011.59","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.59","url":null,"abstract":"High Performance Computing (HPC) systems are composed of servers containing an ever-increasing number of cores. With such high processor core counts, non-uniform memory access (NUMA) architectures are almost universally used to reduce inter-processor and memory communication bottlenecks by distributing processors and memory throughout a server-internal networking topology. Application studies have shown that the tuning of processes placement in a server's NUMA networking topology to the application can have a dramatic impact on performance. The performance implications are magnified when running a parallel job across multiple server nodes, especially with large scale HPC applications. This paper presents the Locality-Aware Mapping Algorithm (LAMA) for distributing the individual processes of a parallel application across processing resources in an HPC system, paying particular attention to the internal server NUMA topologies. The algorithm is able to support both homogeneous and heterogeneous hardware systems, and dynamically adapts to the available hardware and user-specified process layout at run-time. As implemented in Open MPI, the LAMA provides 362,880 mapping permutations and is able to naturally scale out to additional hardware resources as they become available in future architectures.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117093312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Supporting Computing Element Heterogeneity in P2P Grids 支持P2P网格中计算元素的异构性
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.25
J. Lee, P. Keleher, A. Sussman
{"title":"Supporting Computing Element Heterogeneity in P2P Grids","authors":"J. Lee, P. Keleher, A. Sussman","doi":"10.1109/CLUSTER.2011.25","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.25","url":null,"abstract":"We propose resource discovery and load balancing techniques to accommodate computing nodes with many types of computing elements, such as multi-core CPUs and GPUs, in a peer-to-peer desktop grid architecture. Heterogeneous nodes can have multiple types of computing elements, and the performance and characteristics of each computing element can be very different. Our scheme takes into account these diverse aspects of heterogeneous nodes to maximize overall system throughput. However, straightforward methods of handling diverse computing elements that differ on many axes can result in high overheads, both in local state and in communication volume. We describe approaches that minimize messaging costs without sacrificing the failure resilience provided by an underlying peer-to-peer overlay network. Simulation results show that our scheme's load balancing performance is comparable to that of a centralized approach, that communication costs are reduced significantly compared to the existing system, and that failure resilience is not compromised.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126006096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System Blue Gene/P系统中应用级检查点的并行I/O性能
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.81
Jing Fu, M. Min, R. Latham, C. Carothers
{"title":"Parallel I/O Performance for Application-Level Checkpointing on the Blue Gene/P System","authors":"Jing Fu, M. Min, R. Latham, C. Carothers","doi":"10.1109/CLUSTER.2011.81","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.81","url":null,"abstract":"As the number of processors increases to hundreds of thousands in parallel computer architectures, the failure probability rises correspondingly, making fault tolerance a highly important and challenging task. Application-level checkpointing is one of the most popular techniques to proactively deal with unexpected failures because of its portability and flexibility. During the checkpoint phase, the local states of the computation spread across thousands of processors are saved to stable storage. Unfortunately, this approach results in heavy I/O load and can cause an I/O bottleneck in a massively parallel system. In this paper, we examine application-level checkpointing for a massively parallel electromagnetic solver system called NekCEM on the IBM Blue Gene/P at Argonne National Laboratory. We discuss an application-level, two-phase I/O approach, called ‚Äúreduced-blocking I/O‚Äù (rbIO), and a tuned MPI-IO collective approach (coIO), and we demonstrate their performance advantage over the ‚Äú1 POSIX file per processor‚Äù approach. Our study shows that rbIO and coIO result in 100vó improvement over previous checkpointing approaches on up to 65,536 processors of the Blue Gene/P using the GPFS. Our study also demonstrates a 25vó production performance improvement for NekCEM. We show how to optimize parameter settings for those parallel I/O approaches and to verify results by I/O profilings. In particular, we examine the performance advantage of rbIO and demonstrate the potential benefits of this approach over the traditional MPI-IO routine, coIO.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"6 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130013642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 GPU集群的优化非连续MPI数据类型通信:MVAPICH2的设计,实现和评估
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.42
Hao Wang, S. Potluri, Miao Luo, A. Singh, Xiangyong Ouyang, S. Sur, D. Panda
{"title":"Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2","authors":"Hao Wang, S. Potluri, Miao Luo, A. Singh, Xiangyong Ouyang, S. Sur, D. Panda","doi":"10.1109/CLUSTER.2011.42","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.42","url":null,"abstract":"Data parallel architectures, such as General Purpose Graphics Units (GPGPUs) have seen a tremendous rise in their application for High End Computing. However, data movement in and out of GPGPUs remains the biggest hurdle to overall performance and programmer productivity. Real scientific applications utilize multi-dimensional data. Data in higher dimensions may not be contiguous in memory. In order to improve programmer productivity and to enable communication libraries to optimize non-contiguous data communication, the MPI interface provides MPI data types. Currently, state of the art MPI libraries do not provide native data type support for data that resides in GPU memory. The management of non-contiguous GPU data is a source of productivity and performance loss, because GPU application developers have to manually move the data out of and in to GPUs. In this paper, we present our design for enabling high-performance communication support between GPUs for non-contiguous data types. We describe our innovative approach to improve performance by \"offloading\" data type packing and unpacking on to a GPU device, and \"pipelining\" all data transfer stages between two GPUs. Our design is integrated into the popular MVAPICH2 MPI library for InfiniBand, iWARP and RoCE clusters. We perform a detailed evaluation of our design on a GPU cluster with the latest NVIDIA Fermi GPU adapters. The evaluation reveals that the proposed designs can achieve up to 88% latency improvement for vector data type at 4 MB size with micro benchmarks. For Stencil2D application from the SHOC benchmark suite, our design can simplify the data communication in its main loop, reducing the lines of code by 36%. Further, our method can improve the performance of Stencil2D by up to 42% for single precision data set, and 39% for double precision data set. To the best of our knowledge, this is the first such design, implementation and evaluation of non-contiguous MPI data communication for GPU clusters.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"170 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131318598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信