2011 IEEE International Conference on Cluster Computing最新文献_第4页

Symphony: A Scheduler for Client-Server Applications on Coprocessor-Based Heterogeneous Clusters Symphony:基于协处理器的异构集群上客户机-服务器应用程序的调度器

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.46

M. M. Rafique, S. Cadambi, Kunal Rao, A. Butt, S. Chakradhar

{"title":"Symphony: A Scheduler for Client-Server Applications on Coprocessor-Based Heterogeneous Clusters","authors":"M. M. Rafique, S. Cadambi, Kunal Rao, A. Butt, S. Chakradhar","doi":"10.1109/CLUSTER.2011.46","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.46","url":null,"abstract":"Coprocessors such as GPUs are increasingly being deployed in clusters to process scientific and compute-intensive jobs. In this work, we study if GPU-based heterogeneous clusters can benefit client-server applications. Specifically, we consider the practical situation where multiple client-server applications share a heterogeneous cluster (multi-tenancy), and experience unpredictable variations in incoming client request rates, including steep load spikes. Even for \"compute-intensive\" client-server applications, it is unclear if a GPU-based cluster can seamlessly deliver acceptable response times in the presence of multi-tenancy and load spikes. We argue that a cluster-level scheduler that is aware of application load, request deadlines and the heterogeneity is necessary in this situation. We propose a novel scheduler called Symphony that enables efficient, dynamic sharing of a GPU-based heterogeneous cluster across multiple concurrently-executing client-server applications, each with arbitrary load spikes. Symphony performs three key tasks: it (i) monitors the load on each application, (ii) collects past performance data and dynamically builds simple performance models of available processing resources and (iii) computes a priority for pending requests based on the above parameters and the requests' slack. Based on this, it reorders client requests across different applications to achieve acceptable response times. We also define how client-server applications should interact with a scheduler such as Symphony, and develop an API to this end. We deploy Symphony as user-space middleware on a high-end heterogeneous cluster with dual quad-core Xeon CPUs and dual NVIDIA Fermi GPUs. An evaluation using representative applications shows that in the presence of load spikes (i) Symphony incurs 2-20x fewer requests that do not meet response time constraints compared with other schedulers, and (ii) in order to achieve the same performance as Symphony, other schedulers need 2x more cluster nodes.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133258993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Improving PCM Endurance with Randomized Address Remapping in Hybrid Memory System 利用随机地址重映射提高混合存储系统的PCM持久性

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.62

Gang Wu, Jian Gao, Huxing Zhang, Yaozu Dong

{"title":"Improving PCM Endurance with Randomized Address Remapping in Hybrid Memory System","authors":"Gang Wu, Jian Gao, Huxing Zhang, Yaozu Dong","doi":"10.1109/CLUSTER.2011.62","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.62","url":null,"abstract":"Phase-Change-Memory (PCM) has emerged as a promising alternative of DRAM main memory. A new hybrid memory architecture, where DRAM serves as cache of PCM main memory, has been proposed to leverage PCM's high scalability and DRAM's fast access time. One biggest issue of PCM is the limited number of writes to storage cells. We argue that good cache mechanism will decrease PCM writes dramatically in hybrid memory system. In this paper, we demonstrate that traditional set associative cache is susceptible to malicious attacks, which lead certain PCM cells to wear-out by constant cache flushes. A novel approach called Randomized Address Remapping (RAR) is proposed to hide the mapping details between DRAM and PCM. With this approach, the attacks based on set associative cache do not work, while the efficiency of caching still remains. We present Static Randomized Address Remapping (SRAR) and Dynamic Randomized Address Remapping (DRAR) in this paper. SRAR invalidates set associative cache based attacks by distributing their address accesses to different sets. DRAR uses a region-based approach to change the mapping dynamically, in case that the static mapping relationship is discovered by attacker compromising operating system. Experimental results show that RAR approaches can prevent malicious attacks and improve PCM endurance greatly.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114084259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Performance Optimization of Data Structures Using Memory Access Characterization 基于内存访问特性的数据结构性能优化

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.77

A. Rane, J. Browne

引用次数: 9

Performance Characterization and Optimization of Atomic Operations on AMD GPUs AMD gpu上原子运算的性能表征与优化

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.34

M. Elteir, Heshan Lin, Wu-chun Feng

{"title":"Performance Characterization and Optimization of Atomic Operations on AMD GPUs","authors":"M. Elteir, Heshan Lin, Wu-chun Feng","doi":"10.1109/CLUSTER.2011.34","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.34","url":null,"abstract":"Atomic operations are important building blocks in supporting general-purpose computing on graphics processing units (GPUs). For instance, they can be used to coordinate execution between concurrent threads, and in turn, assist in constructing complex data structures such as hash tables or implementing GPU-wide barrier synchronization. While the performance of atomic operations has improved substantially on the latest NVIDIA Fermi-based GPUs, system-provided atomic operations still incur significant performance penalties on AMD GPUs. A memory-bound kernel on an AMD GPU, for example, can suffer severe performance degradation when including an atomic operation, even if the atomic operation is never executed. In this paper, we first quantify the performance impact of atomic instructions to application kernels on AMD GPUs. We then propose a novel software-based implementation of atomic operations that can significantly improve the overall kernel performance. We evaluate its performance against the system-provided atomic using two micro-benchmarks and four real applications. The results show that using our software based atomic operations on an AMD GPU can speedup an application kernel by 67-fold over the same application kernel but with the (default) system-provided atomic operations.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116714897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Predictive and Distributed Routing Balancing for High Speed Interconnection Networks 高速互连网络的预测和分布式路由平衡

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.66

Carlos Nunez Castillo, D. Lugones, Daniel Franco, E. Luque

引用次数: 3

Methodology for Performance Evaluation of the Input/Output System on Computer Clusters 计算机集群输入/输出系统的性能评价方法

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.83

Sandra Méndez, Dolores Rexachs, E. Luque

引用次数: 11

Implementation of Multigrid on QPACE 多重网格在QPACE上的实现

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.48

M. Bolten, Daniel Brinkers, U. Rüde, M. Stürmer

引用次数: 0

Improving I/O Forwarding Throughput with Data Compression 利用数据压缩提高I/O转发吞吐量

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.80

Benjamin Welton, D. Kimpe, Jason Cope, C. Patrick, K. Iskra, R. Ross

引用次数: 65

Performance Analysis and Benchmarking of the Intel SCC 英特尔SCC的性能分析与基准测试

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.24

P. Gschwandtner, T. Fahringer, R. Prodan

{"title":"Performance Analysis and Benchmarking of the Intel SCC","authors":"P. Gschwandtner, T. Fahringer, R. Prodan","doi":"10.1109/CLUSTER.2011.24","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.24","url":null,"abstract":"Over the past years there has been a steady change in CPU design towards both many-core processors and power-aware hardware architectures. These two trends are combined in the Intel Single-chip Cloud Computer (SCC), an experimental prototype with 48 Pentium cores created by Intel Labs. The SCC is a highly configurable many-core chip which provides unique opportunities to optimize run time, communication and memory access as well as power/energy consumption of parallel programs. The aim of this paper is to characterize the performance behavior of the chip with various power settings, mappings of processes/cores to memory controllers, etc through benchmarking. Analytical models are used to verify and interpret the results. Conclusions drawn from our benchmark outcomes are that data exchange based on message passing is faster than shared memory data exchange. Contrary to popular belief, lowest energy consumption is not achieved for the fastest execution time. Furthermore in order to improve the memory access behavior one should increase the clock frequency of both, mesh network and memory controllers. In general, the results of our investigations can be used to analyze the effect of power settings and architecture properties on the performance and energy consumption of parallel programs as well as assist in choosing appropriate settings for specific workloads.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126346359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

On Scalability for MPI Runtime Systems 关于MPI运行时系统的可伸缩性

2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.29

G. Bosilca, T. Hérault, Ala Rezmerita, Jack J. Dongarra

{"title":"On Scalability for MPI Runtime Systems","authors":"G. Bosilca, T. Hérault, Ala Rezmerita, Jack J. Dongarra","doi":"10.1109/CLUSTER.2011.29","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.29","url":null,"abstract":"The future of high performance computing, as being currently foretold, will gravitate toward hundreds of thousands to million node machines, harnessing the computing power of billions of cores. While the hardware part is well covered, the software infrastructure at that scale is vague. However, no matter what the infrastructure will be, efficiently running parallel applications on such large machines will require optimized runtime environments that are scalable and resilient. More particularly, considering a future where Message Passing Interface (MPI) remains a major programming paradigm, the MPI implementations will have to seamlessly adapt to launching and managing large scale applications on resources several levels of magnitude larger than today. In this paper, we present a modified version of the Open MPI runtime that has been adapted towards a scalability goal. We evaluate the performance and compare it with two widely used runtime systems: the default version of Open MPI and MPICH2; using various underlying launching systems. The performance evaluation demonstrates a significant improvement over the state of the art. We also discuss the basic requirements for an exascale-ready parallel runtime.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129194843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15