2011 IEEE International Conference on Cluster Computing最新文献

筛选
英文 中文
Improving MapReduce Performance via Heterogeneity-Load-Aware Partition Function 通过异构负载感知分区函数提高MapReduce性能
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.68
Huifeng Sun, Junliang Chen, Chuanchang Liu, Zibin Zheng, Nan Yu, Zhi Yang
{"title":"Improving MapReduce Performance via Heterogeneity-Load-Aware Partition Function","authors":"Huifeng Sun, Junliang Chen, Chuanchang Liu, Zibin Zheng, Nan Yu, Zhi Yang","doi":"10.1109/CLUSTER.2011.68","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.68","url":null,"abstract":"MapReduce is an important programming model for large-scale data-intensive applications such as web indexing, scientific simulation, and data mining. Hadoop is an open-source implementation of MapReduce enjoying wide adoption. Partition function is an important component of Hadoop which split outputs of maps into bulks that place the input data of reduces. Based on the assumptions that cluster nodes are homogeneous and perform work at roughly the same rate, its default partition function splits intermediate keys into reduces. However, in practice the homogeneity assumptions seldom hold and cluster nodes usually perform work at different rate. In this paper, we design a heterogeneity-load-aware partition function named proportional partition function (PPF). Besides the dynamic loading of cluster nodes, PPF considers the capacity diversity of cluster nodes such as CPU processing speed and disk writing speed.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116634064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Asynchronous Collective Output with Non-dedicated Cores 非专用核异步集体输出
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.82
P. Miller, Shen Li, Chao Mei
{"title":"Asynchronous Collective Output with Non-dedicated Cores","authors":"P. Miller, Shen Li, Chao Mei","doi":"10.1109/CLUSTER.2011.82","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.82","url":null,"abstract":"Parallel applications are evolving to place larger demands not just on computation and network capabilities, but on storage systems as well. Storage hardware has scaled to keep up, but the software to drive it must evolve alongside to service this increased potential. This paper presents an output forwarding middleware for message-driven parallel applications written in Charm++. This layer directs IO operations across the entire system to a designated subset of processors in order to minimize contention and overheads. Our implementation is distinctive in that these processors are not dedicated to this task, but can still contribute to the computational task. Other processors need not block while waiting for the designated IO processors to become ready or make progress. Using this new layer, we demonstrate speedups of 1.5 - 2.5× in the popular scientific code NAMD over its previous parallel output implementation, along with reduced sensitivity to IO subsystem parameters.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131071476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Multicore/GPGPU Portable Computational Kernels via Multidimensional Arrays 基于多维数组的多核/GPGPU便携式计算内核
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.47
H. C. Edwards, Daniel Sunderland, Chris Amsler, Elec Eng, Dept, Sam P. Mish
{"title":"Multicore/GPGPU Portable Computational Kernels via Multidimensional Arrays","authors":"H. C. Edwards, Daniel Sunderland, Chris Amsler, Elec Eng, Dept, Sam P. Mish","doi":"10.1109/CLUSTER.2011.47","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.47","url":null,"abstract":"Large, complex scientific and engineering application code have a significant investment in computational kernels to implement their mathematical models. Porting these computational kernels to the collection of modern many core accelerator devices is a major challenge in that these devices have diverse programming models, application programming interfaces (APIs), and performance requirements. The Trilinos-Kokkos array programming model provides library based approach to implement computational kernels that are performance-portable to CPU-multicore and GPGPU accelerator devices. This programming model is based upon three fundamental concepts: (1) there exists one or more many core compute devices each with its own memory space, (2) data parallel kernels are executed via parallel for and parallel reduce operations, and (3) kernels operate on multidimensional arrays. Kernel execution performance is, especially for NVIDIA R GPGPU devices, extremely dependent on data access patterns. An optimal data access pattern can be different for different many core devices -- potentially leading to different implementations of computational kernels specialized for different devices. The Trilinos-Kokkos programming model support performance-portable kernels by separating data access patterns from computational kernels through a multidimensional array API. Through this API device-specific mappings of multiindices to device memory are introduced into a computational kernel through compile-time polymorphism, i.e., without modification of the kernel.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132743082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Evaluating Performance Impacts of Delayed Failure Repairing on Large-Scale Systems 大型系统延迟故障修复的性能影响评估
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.71
Zhou Zhou, Wei Tang, Ziming Zheng, Z. Lan, N. Desai
{"title":"Evaluating Performance Impacts of Delayed Failure Repairing on Large-Scale Systems","authors":"Zhou Zhou, Wei Tang, Ziming Zheng, Z. Lan, N. Desai","doi":"10.1109/CLUSTER.2011.71","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.71","url":null,"abstract":"With the fast improvement in technology, we are now moving toward exascale computing. Many experts predict that exascale computers will have millions of nodes, billions of threads of execution, hundreds of petabytes of inner memory and exabytes of persistent storage. For systems of such a scale, frequent failures are becoming a serious concern. One of the most important reasons is that in a large-scale system it is hard to detect failures. As a result, failure repair may take substantial time. In this paper, we investigate the effect of delayed repairing on two popular types of high-performance computing systems: IBM Blue Gene/P and general cluster. We analyze how delayed failure repairing will affect the performance of jobs when some computing units are at fault but not fixed in time. Our study is based on real workload traces and RAS logs collected from production supercomputing systems. Our Trace-based simulations indicate that fast failure detection and recovery is essential for moving towards petascale and beyond computing.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"634 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123960298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Accelerating Galois Field Arithmetic for Reed-Solomon Erasure Codes in Storage Applications 存储应用中Reed-Solomon擦除码的加速伽罗瓦域算法
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.40
S. Kalcher, V. Lindenstruth
{"title":"Accelerating Galois Field Arithmetic for Reed-Solomon Erasure Codes in Storage Applications","authors":"S. Kalcher, V. Lindenstruth","doi":"10.1109/CLUSTER.2011.40","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.40","url":null,"abstract":"Galois fields (also called finite fields) play an essential role in the areas of cryptography and coding theory. They are the foundation of various error- and erasure-correcting codes and therefore central to the design of reliable storage systems. The efficiency and performance of these systems depend considerably on the implementation of Galois field arithmetic, in particular on the implementation of the multiplication. In current software implementations multiplication is typically performed by using pre-calculated lookup tables for the logarithm and its inverse or even for the full multiplication result. However, today the memory subsystem has become one of the main bottlenecks in commodity systems and relying on large in-memory data structures accessed from inner loop code can severely impact the overall performance and deteriorate scalability. In this paper, we study the execution of Galois field multiplication on modern processor architectures without using lookup tables. Instead we propose to trade computation for memory references and, therefore, to perform full polynomial multiplication with modular reduction using the generator polynomial of the Galois field. We present a SIMDized (vectorized) implementation of the polynomial multiplication algorithm in GF(2ˆ16) and show the performance on commodity processors and on GPGPU accelerators.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129336164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Datamation: A Quarter of a Century and Four Orders of Magnitude Later 数据化:四分之一个世纪和四个数量级之后
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.75
P. Bertasi, M. Bonazza, M. Bressan, E. Peserico
{"title":"Datamation: A Quarter of a Century and Four Orders of Magnitude Later","authors":"P. Bertasi, M. Bonazza, M. Bressan, E. Peserico","doi":"10.1109/CLUSTER.2011.75","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.75","url":null,"abstract":"The combination of the high-performance psort sorting library and of a carefully tuned desktop-class cluster allowed us to improve the previous record on the Datamation sort benchmark by over an order of magnitude, sorting a million 100 byte records from disk to disk in a few dozen milliseconds. Of the many implementation and configuration choices we faced, the most crucial were judicious data placement and access patterns on disk, adoption of UDP sockets instead of MPI, careful pruning of virtually all system daemons, and rejection of ``on demand'' frequency scaling.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127362008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Computer System Characterization for a Parallelizing Compiler 并行编译器的自动计算机系统特性
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.32
A. Sussman, N. Lo, T. Anderson
{"title":"Automatic Computer System Characterization for a Parallelizing Compiler","authors":"A. Sussman, N. Lo, T. Anderson","doi":"10.1109/CLUSTER.2011.32","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.32","url":null,"abstract":"Effectively utilizing the compute power of modern multi-core machines is a challenging task for a programmer. Automated extraction of shared memory parallelism via powerful compiler transformations and optimizations is one means to such a goal. However, the effectiveness of such transformations is tied to detailed characteristics of the target computer system. In this paper, we describe an automated system for capturing such computer system characteristics that is based on prior work on various parts of the overall problem. The system characteristics measured include the number of available compute elements available to run threads, multiple memory hierarchy parameters, and functional unit latencies and bandwidths. We show experimental results on a wide range of compute platforms that validate the effectiveness of the overall approach.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123435343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Evolutionary Scheduling of Parallel Tasks Graphs onto Homogeneous Clusters 同构聚类上并行任务图的进化调度
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.45
S. Hunold, Joachim Lepping
{"title":"Evolutionary Scheduling of Parallel Tasks Graphs onto Homogeneous Clusters","authors":"S. Hunold, Joachim Lepping","doi":"10.1109/CLUSTER.2011.45","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.45","url":null,"abstract":"Parallel task graphs (PTGs) arise when parallel programs are combined to larger applications, e.g., scientific workflows. Scheduling these PTGs onto clusters is a challenging problem due to the additional degree of parallelism stemming from moldable tasks. Most algorithms are based on the assumption that the execution time of a parallel task is monotonically decreasing as the number of processors increases. But this assumption does not hold in practice since parallel programs often perform better if the number of processors is a multiple of internally used block sizes. In this article, we introduce the Evolutionary Moldable Task Scheduling (EMTS) algorithm for scheduling static PTGs onto homogeneous clusters. We apply an evolutionary approach to determine the processor allocation of each task. The evolutionary strategy ensures that EMTS can be used with any underlying model for predicting the execution time of moldable tasks. With the purpose of finding solutions quickly, EMTS considers results of other heuristics (e.g., HCPA, MCPA) as starting solutions. The experimental results show that EMTS significantly reduces the make span of PTGs compared to other heuristics for both non-monotonically and monotonically decreasing models.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121886103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Process Distance-Aware Adaptive MPI Collective Communications 进程距离感知自适应MPI集体通信
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.30
Teng Ma, T. Hérault, G. Bosilca, J. Dongarra
{"title":"Process Distance-Aware Adaptive MPI Collective Communications","authors":"Teng Ma, T. Hérault, G. Bosilca, J. Dongarra","doi":"10.1109/CLUSTER.2011.30","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.30","url":null,"abstract":"Message Passing Interface (MPI) implementations provide a great flexibility to allow users to arbitrarily bind processes to computing cores to fully exploit clusters of multicore/ many-core nodes. An intelligent process placement can optimize application performance according to underlying hardware architecture and the application's communication pattern. However, such static process placement optimization can't help MPI collective communication, whose topology is dynamic with members in each communicator. Conversely, a mismatch between the collective communication topology, the underlying hardware architecture and the process placement often happens due to the MPI's limited capabilities of dealing with complex environments. This paper proposes an adaptive collective communication framework by combining process distance, underlying hardware topologies, and runtime communicator together. Based on this information, an optimal communication topology will be generated to guarantee maximum bandwidth for each MPI collective operation regardless of process placement. Based on this framework, two distance-aware adaptive intra-node collective operations (Broadcast and All gather) are implemented as examples inside Open MPI's KNEM collective component. The awareness of process distance helps these two operations construct optimal runtime topologies and balance memory accesses across memory nodes. The experiments show these two distance-aware collective operations provide better and more stable performance than current collectives in Open MPI regardless of process placement.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130955347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Automatically Selecting the Number of Aggregators for Collective I/O Operations 自动选择聚合I/O操作的聚合器数量
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.79
M. Chaarawi, E. Gabriel
{"title":"Automatically Selecting the Number of Aggregators for Collective I/O Operations","authors":"M. Chaarawi, E. Gabriel","doi":"10.1109/CLUSTER.2011.79","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.79","url":null,"abstract":"Optimizing collective I/O operations is of paramount importance for many data intensive high performance computing applications. Despite the large number of algorithms published in the field, most current approaches focus on tuning every single application scenario separately and do not offer a consistent and automatic method of identifying internal parameters for collective I/O algorithms. Most notably, published work exists to optimize the number of processes actually touching a file, the so-called aggregators. This paper introduces a novel runtime approach to determine the number of aggregator processes to be used in a collective I/O operation depending on the file view, process topology, the per-process write saturation point, and the actual amount of data written in a collective write operation. The algorithm is evaluated on two different file systems with multiple benchmarks. In more than 80% of the test cases, our algorithm delivered a performance close to the best performance obtained by hand-tuning the number of aggregators for each scenario.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132211657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信