2011 IEEE International Conference on Cluster Computing最新文献

筛选
英文 中文
Symphony: A Scheduler for Client-Server Applications on Coprocessor-Based Heterogeneous Clusters Symphony:基于协处理器的异构集群上客户机-服务器应用程序的调度器
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.46
M. M. Rafique, S. Cadambi, Kunal Rao, A. Butt, S. Chakradhar
{"title":"Symphony: A Scheduler for Client-Server Applications on Coprocessor-Based Heterogeneous Clusters","authors":"M. M. Rafique, S. Cadambi, Kunal Rao, A. Butt, S. Chakradhar","doi":"10.1109/CLUSTER.2011.46","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.46","url":null,"abstract":"Coprocessors such as GPUs are increasingly being deployed in clusters to process scientific and compute-intensive jobs. In this work, we study if GPU-based heterogeneous clusters can benefit client-server applications. Specifically, we consider the practical situation where multiple client-server applications share a heterogeneous cluster (multi-tenancy), and experience unpredictable variations in incoming client request rates, including steep load spikes. Even for \"compute-intensive\" client-server applications, it is unclear if a GPU-based cluster can seamlessly deliver acceptable response times in the presence of multi-tenancy and load spikes. We argue that a cluster-level scheduler that is aware of application load, request deadlines and the heterogeneity is necessary in this situation. We propose a novel scheduler called Symphony that enables efficient, dynamic sharing of a GPU-based heterogeneous cluster across multiple concurrently-executing client-server applications, each with arbitrary load spikes. Symphony performs three key tasks: it (i) monitors the load on each application, (ii) collects past performance data and dynamically builds simple performance models of available processing resources and (iii) computes a priority for pending requests based on the above parameters and the requests' slack. Based on this, it reorders client requests across different applications to achieve acceptable response times. We also define how client-server applications should interact with a scheduler such as Symphony, and develop an API to this end. We deploy Symphony as user-space middleware on a high-end heterogeneous cluster with dual quad-core Xeon CPUs and dual NVIDIA Fermi GPUs. An evaluation using representative applications shows that in the presence of load spikes (i) Symphony incurs 2-20x fewer requests that do not meet response time constraints compared with other schedulers, and (ii) in order to achieve the same performance as Symphony, other schedulers need 2x more cluster nodes.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133258993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Improving PCM Endurance with Randomized Address Remapping in Hybrid Memory System 利用随机地址重映射提高混合存储系统的PCM持久性
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.62
Gang Wu, Jian Gao, Huxing Zhang, Yaozu Dong
{"title":"Improving PCM Endurance with Randomized Address Remapping in Hybrid Memory System","authors":"Gang Wu, Jian Gao, Huxing Zhang, Yaozu Dong","doi":"10.1109/CLUSTER.2011.62","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.62","url":null,"abstract":"Phase-Change-Memory (PCM) has emerged as a promising alternative of DRAM main memory. A new hybrid memory architecture, where DRAM serves as cache of PCM main memory, has been proposed to leverage PCM's high scalability and DRAM's fast access time. One biggest issue of PCM is the limited number of writes to storage cells. We argue that good cache mechanism will decrease PCM writes dramatically in hybrid memory system. In this paper, we demonstrate that traditional set associative cache is susceptible to malicious attacks, which lead certain PCM cells to wear-out by constant cache flushes. A novel approach called Randomized Address Remapping (RAR) is proposed to hide the mapping details between DRAM and PCM. With this approach, the attacks based on set associative cache do not work, while the efficiency of caching still remains. We present Static Randomized Address Remapping (SRAR) and Dynamic Randomized Address Remapping (DRAR) in this paper. SRAR invalidates set associative cache based attacks by distributing their address accesses to different sets. DRAR uses a region-based approach to change the mapping dynamically, in case that the static mapping relationship is discovered by attacker compromising operating system. Experimental results show that RAR approaches can prevent malicious attacks and improve PCM endurance greatly.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114084259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Performance Optimization of Data Structures Using Memory Access Characterization 基于内存访问特性的数据结构性能优化
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.77
A. Rane, J. Browne
{"title":"Performance Optimization of Data Structures Using Memory Access Characterization","authors":"A. Rane, J. Browne","doi":"10.1109/CLUSTER.2011.77","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.77","url":null,"abstract":"Program performance optimization is generally based on measurements of execution behavior of code segments. However, an equally important task for performance optimizations is understanding memory access behaviors and thus, data structure access patterns and properties. Because memory-related problems in multi-core applications can have a significant impact on overall performance, optimizations in data access patterns will likely give a big boost to application performance. But effective diagnosis of performance bottlenecks requires that the memory measurements be related to high-level data structures (C, C++ arrays, structures, etc.). In this work, we present a low-overhead tool that captures memory traces and computes several metrics for performance characteristics of source-level data structures. Explicit consideration is given to measurement and diagnosis for multicore chips. Case studies which include (manual) use of the data structure memory access metrics to select and implement optimizations are given.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"54 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114126619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Performance Characterization and Optimization of Atomic Operations on AMD GPUs AMD gpu上原子运算的性能表征与优化
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.34
M. Elteir, Heshan Lin, Wu-chun Feng
{"title":"Performance Characterization and Optimization of Atomic Operations on AMD GPUs","authors":"M. Elteir, Heshan Lin, Wu-chun Feng","doi":"10.1109/CLUSTER.2011.34","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.34","url":null,"abstract":"Atomic operations are important building blocks in supporting general-purpose computing on graphics processing units (GPUs). For instance, they can be used to coordinate execution between concurrent threads, and in turn, assist in constructing complex data structures such as hash tables or implementing GPU-wide barrier synchronization. While the performance of atomic operations has improved substantially on the latest NVIDIA Fermi-based GPUs, system-provided atomic operations still incur significant performance penalties on AMD GPUs. A memory-bound kernel on an AMD GPU, for example, can suffer severe performance degradation when including an atomic operation, even if the atomic operation is never executed. In this paper, we first quantify the performance impact of atomic instructions to application kernels on AMD GPUs. We then propose a novel software-based implementation of atomic operations that can significantly improve the overall kernel performance. We evaluate its performance against the system-provided atomic using two micro-benchmarks and four real applications. The results show that using our software based atomic operations on an AMD GPU can speedup an application kernel by 67-fold over the same application kernel but with the (default) system-provided atomic operations.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116714897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Predictive and Distributed Routing Balancing for High Speed Interconnection Networks 高速互连网络的预测和分布式路由平衡
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.66
Carlos Nunez Castillo, D. Lugones, Daniel Franco, E. Luque
{"title":"Predictive and Distributed Routing Balancing for High Speed Interconnection Networks","authors":"Carlos Nunez Castillo, D. Lugones, Daniel Franco, E. Luque","doi":"10.1109/CLUSTER.2011.66","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.66","url":null,"abstract":"Current parallel applications in parallel computing systems require an interconnection network to provide low and bounded communication delays. Communication characteristics such as traffic pattern and communication load change over time and, eventually, they may exceed available network capacity causing congestion and performance degradation. Congestion control based on adaptive routing should be applied in order to adapt quickly to changing traffic conditions. Studies on a vast range of parallel applications show repetitive behavior and can be characterized by a set of representative phases. This work presents a Predictive and Distributed Routing Balancing technique (PR-DRB) to control network congestion based on adaptive traffic distribution. PR-DRB uses speculative routing based on application repetitiveness. PR-DRB monitors messages latencies on routers and logs solutions to congestion, to quickly respond in future similar situations. Experimental results show that the predictive approach could be used to improve performance.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115037907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Methodology for Performance Evaluation of the Input/Output System on Computer Clusters 计算机集群输入/输出系统的性能评价方法
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.83
Sandra Méndez, Dolores Rexachs, E. Luque
{"title":"Methodology for Performance Evaluation of the Input/Output System on Computer Clusters","authors":"Sandra Méndez, Dolores Rexachs, E. Luque","doi":"10.1109/CLUSTER.2011.83","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.83","url":null,"abstract":"The increase of processing units, speed and computational power, and the complexity of scientific applications that use high performance computing require more efficient Input/Output (I/O) systems. In order to efficiently use the I/O it is necessary to know its performance capacity to determine if it fulfills applications I/O requirements. This paper proposes a methodology to evaluate I/O performance on computer clusters under different I/O configurations. This evaluation is useful to study how different I/O subsystem configurations will affect the application performance. This approach encompasses the characterization of the I/O system at three different levels: application, I/O system and I/O devices. We select different system configuration and/or I/O operation parameters and we evaluate the impact on performance by considering both the application and the I/O architecture. During I/O configuration analysis we identify configurable factors that have an impact on the performance of the I/O system. In addition, we extract information in order to select the most suitable configuration for the application.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130533431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Implementation of Multigrid on QPACE 多重网格在QPACE上的实现
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.48
M. Bolten, Daniel Brinkers, U. Rüde, M. Stürmer
{"title":"Implementation of Multigrid on QPACE","authors":"M. Bolten, Daniel Brinkers, U. Rüde, M. Stürmer","doi":"10.1109/CLUSTER.2011.48","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.48","url":null,"abstract":"We developed and optimized a multigrid method on the QPACE cluster. The QPACE cluster is an acclerator-based cluster using the Power Cell 8i CPU that is built by the special research field SFB TR 55 for Lattice Quantum Chromo dynamics computations. The cluster uses a custom 3D to rus network build using FPGAs. Our goal was to evaluate the QPACE architecture for a type of algorithm that uses a communication pattern not limited to nearest neighbor communication. We provide a model of the communication network taking into account the specific characteristics of the network and the network processor. For the implementation we chose to use an accelerator-centric programming model by using the SPUs, only.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116662864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving I/O Forwarding Throughput with Data Compression 利用数据压缩提高I/O转发吞吐量
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.80
Benjamin Welton, D. Kimpe, Jason Cope, C. Patrick, K. Iskra, R. Ross
{"title":"Improving I/O Forwarding Throughput with Data Compression","authors":"Benjamin Welton, D. Kimpe, Jason Cope, C. Patrick, K. Iskra, R. Ross","doi":"10.1109/CLUSTER.2011.80","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.80","url":null,"abstract":"While network bandwidth is steadily increasing, it is doing so at a much slower rate than the corresponding increase in CPU performance. This trend has widened the gap between CPU and network speed. In this paper, we investigate improvements to I/O performance by exploiting this gap. We harness idle CPU resources to compress network traffic, reducing the amount of data transferred over the network and increasing effective network bandwidth. We created a set of compression services within the I/O Forwarding Scalability Layer. These services transparently compress and decompress data as it is transferred over the network. We studied the effect of the compression services on a variety of data sets and conducted experiments on a high-performance computing cluster.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123887292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 65
Performance Analysis and Benchmarking of the Intel SCC 英特尔SCC的性能分析与基准测试
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.24
P. Gschwandtner, T. Fahringer, R. Prodan
{"title":"Performance Analysis and Benchmarking of the Intel SCC","authors":"P. Gschwandtner, T. Fahringer, R. Prodan","doi":"10.1109/CLUSTER.2011.24","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.24","url":null,"abstract":"Over the past years there has been a steady change in CPU design towards both many-core processors and power-aware hardware architectures. These two trends are combined in the Intel Single-chip Cloud Computer (SCC), an experimental prototype with 48 Pentium cores created by Intel Labs. The SCC is a highly configurable many-core chip which provides unique opportunities to optimize run time, communication and memory access as well as power/energy consumption of parallel programs. The aim of this paper is to characterize the performance behavior of the chip with various power settings, mappings of processes/cores to memory controllers, etc through benchmarking. Analytical models are used to verify and interpret the results. Conclusions drawn from our benchmark outcomes are that data exchange based on message passing is faster than shared memory data exchange. Contrary to popular belief, lowest energy consumption is not achieved for the fastest execution time. Furthermore in order to improve the memory access behavior one should increase the clock frequency of both, mesh network and memory controllers. In general, the results of our investigations can be used to analyze the effect of power settings and architecture properties on the performance and energy consumption of parallel programs as well as assist in choosing appropriate settings for specific workloads.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126346359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
On Scalability for MPI Runtime Systems 关于MPI运行时系统的可伸缩性
2011 IEEE International Conference on Cluster Computing Pub Date : 2011-09-26 DOI: 10.1109/CLUSTER.2011.29
G. Bosilca, T. Hérault, Ala Rezmerita, Jack J. Dongarra
{"title":"On Scalability for MPI Runtime Systems","authors":"G. Bosilca, T. Hérault, Ala Rezmerita, Jack J. Dongarra","doi":"10.1109/CLUSTER.2011.29","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.29","url":null,"abstract":"The future of high performance computing, as being currently foretold, will gravitate toward hundreds of thousands to million node machines, harnessing the computing power of billions of cores. While the hardware part is well covered, the software infrastructure at that scale is vague. However, no matter what the infrastructure will be, efficiently running parallel applications on such large machines will require optimized runtime environments that are scalable and resilient. More particularly, considering a future where Message Passing Interface (MPI) remains a major programming paradigm, the MPI implementations will have to seamlessly adapt to launching and managing large scale applications on resources several levels of magnitude larger than today. In this paper, we present a modified version of the Open MPI runtime that has been adapted towards a scalability goal. We evaluate the performance and compare it with two widely used runtime systems: the default version of Open MPI and MPICH2; using various underlying launching systems. The performance evaluation demonstrates a significant improvement over the state of the art. We also discuss the basic requirements for an exascale-ready parallel runtime.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129194843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信