2011 IEEE International Conference on Cluster Computing最新文献_第3页

Improving MapReduce Performance via Heterogeneity-Load-Aware Partition Function 通过异构负载感知分区函数提高MapReduce性能

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.68

Huifeng Sun, Junliang Chen, Chuanchang Liu, Zibin Zheng, Nan Yu, Zhi Yang

引用次数: 1

Asynchronous Collective Output with Non-dedicated Cores 非专用核异步集体输出

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.82

P. Miller, Shen Li, Chao Mei

引用次数: 2

Multicore/GPGPU Portable Computational Kernels via Multidimensional Arrays 基于多维数组的多核/GPGPU便携式计算内核

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.47

H. C. Edwards, Daniel Sunderland, Chris Amsler, Elec Eng, Dept, Sam P. Mish

{"title":"Multicore/GPGPU Portable Computational Kernels via Multidimensional Arrays","authors":"H. C. Edwards, Daniel Sunderland, Chris Amsler, Elec Eng, Dept, Sam P. Mish","doi":"10.1109/CLUSTER.2011.47","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.47","url":null,"abstract":"Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern many core accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Trilinos-Kokkos array programming model provides library based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) there exists one or more many core compute devices each with its own memory space, (2) data parallel kernels are executed via parallel for and parallel reduce operations, and (3) kernels operate on multidimensional arrays. Kernel execution performance is, especially for NVIDIA R GPGPU devices, extremely dependent on data access patterns. An optimal data access pattern can be different for different many core devices -- potentially leading to different implementations of computational kernels specialized for different devices. The Trilinos-Kokkos programming model support performance-portable kernels by separating data access patterns from computational kernels through a multidimensional array API. Through this API device-specific mappings of multiindices to device memory are introduced into a computational kernel through compile-time polymorphism, i.e., without modification of the kernel.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132743082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Evaluating Performance Impacts of Delayed Failure Repairing on Large-Scale Systems 大型系统延迟故障修复的性能影响评估

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.71

Zhou Zhou, Wei Tang, Ziming Zheng, Z. Lan, N. Desai

引用次数: 2

Accelerating Galois Field Arithmetic for Reed-Solomon Erasure Codes in Storage Applications 存储应用中Reed-Solomon擦除码的加速伽罗瓦域算法

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.40

S. Kalcher, V. Lindenstruth

{"title":"Accelerating Galois Field Arithmetic for Reed-Solomon Erasure Codes in Storage Applications","authors":"S. Kalcher, V. Lindenstruth","doi":"10.1109/CLUSTER.2011.40","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.40","url":null,"abstract":"Galois fields (also called finite fields) play an essential role in the areas of cryptography and coding theory. They are the foundation of various error- and erasure-correcting codes and therefore central to the design of reliable storage systems. The efficiency and performance of these systems depend considerably on the implementation of Galois field arithmetic, in particular on the implementation of the multiplication. In current software implementations multiplication is typically performed by using pre-calculated lookup tables for the logarithm and its inverse or even for the full multiplication result. However, today the memory subsystem has become one of the main bottlenecks in commodity systems and relying on large in-memory data structures accessed from inner loop code can severely impact the overall performance and deteriorate scalability. In this paper, we study the execution of Galois field multiplication on modern processor architectures without using lookup tables. Instead we propose to trade computation for memory references and, therefore, to perform full polynomial multiplication with modular reduction using the generator polynomial of the Galois field. We present a SIMDized (vectorized) implementation of the polynomial multiplication algorithm in GF(2ˆ16) and show the performance on commodity processors and on GPGPU accelerators.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129336164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Datamation: A Quarter of a Century and Four Orders of Magnitude Later 数据化:四分之一个世纪和四个数量级之后

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.75

P. Bertasi, M. Bonazza, M. Bressan, E. Peserico

引用次数: 0

Automatic Computer System Characterization for a Parallelizing Compiler 并行编译器的自动计算机系统特性

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.32

A. Sussman, N. Lo, T. Anderson

引用次数: 4

Evolutionary Scheduling of Parallel Tasks Graphs onto Homogeneous Clusters 同构聚类上并行任务图的进化调度

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.45

S. Hunold, Joachim Lepping

{"title":"Evolutionary Scheduling of Parallel Tasks Graphs onto Homogeneous Clusters","authors":"S. Hunold, Joachim Lepping","doi":"10.1109/CLUSTER.2011.45","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.45","url":null,"abstract":"Parallel task graphs (PTGs) arise when parallel programs are combined to larger applications, e.g., scientific workflows. Scheduling these PTGs onto clusters is a challenging problem due to the additional degree of parallelism stemming from moldable tasks. Most algorithms are based on the assumption that the execution time of a parallel task is monotonically decreasing as the number of processors increases. But this assumption does not hold in practice since parallel programs often perform better if the number of processors is a multiple of internally used block sizes. In this article, we introduce the Evolutionary Moldable Task Scheduling (EMTS) algorithm for scheduling static PTGs onto homogeneous clusters. We apply an evolutionary approach to determine the processor allocation of each task. The evolutionary strategy ensures that EMTS can be used with any underlying model for predicting the execution time of moldable tasks. With the purpose of finding solutions quickly, EMTS considers results of other heuristics (e.g., HCPA, MCPA) as starting solutions. The experimental results show that EMTS significantly reduces the make span of PTGs compared to other heuristics for both non-monotonically and monotonically decreasing models.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121886103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Process Distance-Aware Adaptive MPI Collective Communications 进程距离感知自适应MPI集体通信

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.30

Teng Ma, T. Hérault, G. Bosilca, J. Dongarra

{"title":"Process Distance-Aware Adaptive MPI Collective Communications","authors":"Teng Ma, T. Hérault, G. Bosilca, J. Dongarra","doi":"10.1109/CLUSTER.2011.30","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.30","url":null,"abstract":"Message Passing Interface (MPI) implementations provide a great flexibility to allow users to arbitrarily bind processes to computing cores to fully exploit clusters of multicore/ many-core nodes. An intelligent process placement can optimize application performance according to underlying hardware architecture and the application's communication pattern. However, such static process placement optimization can't help MPI collective communication, whose topology is dynamic with members in each communicator. Conversely, a mismatch between the collective communication topology, the underlying hardware architecture and the process placement often happens due to the MPI's limited capabilities of dealing with complex environments. This paper proposes an adaptive collective communication framework by combining process distance, underlying hardware topologies, and runtime communicator together. Based on this information, an optimal communication topology will be generated to guarantee maximum bandwidth for each MPI collective operation regardless of process placement. Based on this framework, two distance-aware adaptive intra-node collective operations (Broadcast and All gather) are implemented as examples inside Open MPI's KNEM collective component. The awareness of process distance helps these two operations construct optimal runtime topologies and balance memory accesses across memory nodes. The experiments show these two distance-aware collective operations provide better and more stable performance than current collectives in Open MPI regardless of process placement.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130955347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Automatically Selecting the Number of Aggregators for Collective I/O Operations 自动选择聚合I/O操作的聚合器数量

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.79

M. Chaarawi, E. Gabriel

引用次数: 32