{"title":"Analyzing the Performance Bottlenecks of the POWER7-IH Network","authors":"D. Kerbyson, K. Barker","doi":"10.1109/CLUSTER.2011.35","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.35","url":null,"abstract":"In this work we provide an early performance analysis of the communication network in a small-scale POWER7-IH processing system from IBM. Using a set of communication micro-benchmarks we quantify the achievable bandwidth of the communication links available in the system that differ in their peak performance characteristics. We also identify the bottlenecks within the communication network and show that the bandwidth a single node can inject into the network is considerably less than the bandwidth available to the IBM hub chip, that acts as a NIC to the node as well as being an integral part of the P7-IH network. Using a communication pattern that is representative of activities in many scientific applications that have regular communication patterns, we show how the default task-to-core assignment on the P7-IH achieves sub-optimal performance in most cases. We also show that when using a diagonal-cyclic assignment, as developed in this work that takes into account the network topology as well as routing strategy, the communication performance can be improved by up to 75%. We expect even greater improvements in the communication performance on larger P7-IH systems.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132032496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Greedy Genetic Algorithm for Job Scheduling in Cluster Enviornments","authors":"Gholamali Rahnavard, Jharrod Lafon, Hadi Sharifi","doi":"10.1109/CLUSTER.2011.57","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.57","url":null,"abstract":"Recently, many scientific researchers and applications work on large amounts of data or use high performance computing resources. A high performance cluster is developed to handle massively parallel processes. To manage the resources for dynamic requests with optimal usage, we have to maximize the utilization rate of clusters. In this paper we provide a parallel genetic algorithm to schedule the jobs for different classes of clusters. The greedy approach is used to create an initial population for the genetic algorithm. We applied the master/slave method in parallelism to manage the schedulers and improve the performance of the main scheduler. Analyzing the complexity of the algorithm shows that it can be more efficient than similar algorithms.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122062670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An ISO-Energy-Efficient Approach to Scalable System Power-Performance Optimization","authors":"S. Song, M. Grove, K. Cameron","doi":"10.1109/CLUSTER.2011.37","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.37","url":null,"abstract":"The power consumption of a large scale system ultimately limits its performance. Consuming less energy while preserving performance leads to better system utilization at scale. The is o-energy-efficiency model was proposed as a metric and methodology for explaining power and performance efficiency on scalable systems. For use in practice, we need to determine what parameters should be modified to maintain a desired efficiency. Unfortunately, without extension, the iso-energy-efficiency model cannot be used for this purpose. In this paper we extend the iso-energy-efficiency model to identify appropriate efficiency values for workload and power scaling on clusters. We propose the use of \"correlation functions\" to quantitatively explain the isolated and interacting effects of these two parameters for three representative applications: LINPACK, row-oriented matrix multiplication, and 3D Fourier transform. We show quantitatively that the iso-energy-efficiency model with correlation functions is effective at maintaining efficiency as system size scales.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"68 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122550162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pavel Shamis, R. Graham, Manjunath Gorentla Venkata, Joshua Ladd
{"title":"Design and Implementation of Broadcast Algorithms for Extreme-Scale Systems","authors":"Pavel Shamis, R. Graham, Manjunath Gorentla Venkata, Joshua Ladd","doi":"10.1109/CLUSTER.2011.17","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.17","url":null,"abstract":"The scalability and performance of collective communication operations limit the scalability and performance of many scientific applications. This paper presents two new blocking and nonblocking Broadcast algorithms for communicators with arbitrary communication topology, and studies their performance. These algorithms benefit from increased concurrency and a reduced memory footprint, making them suitable for use on large-scale systems. Measuring small, medium, and large data Broadcasts on a Cray-XT5, using 24,576 MPI processes, the Cheetah algorithms outperform the native MPI on that system by 51%, 69%, and 9%, respectively, at the same process count. These results demonstrate an algorithmic approach to the implementation of the important class of collective communications, which is high performing, scalable, and also uses resources in a scalable manner.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130300532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Energy-Efficient Scheme for Cloud Resource Provisioning Based on CloudSim","authors":"Yuxiang Shi, Xiaohong Jiang, Kejiang Ye","doi":"10.1109/CLUSTER.2011.63","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.63","url":null,"abstract":"Cloud computing has recently received considerable attention. With the fast development of cloud computing, the data center is becoming larger in scale and consumes more energy. There is an emergency need to develop efficient energy-saving methods to reduce the huge energy consumption in the cloud data center. In this paper, we achieve this goal by dynamically allocating resources based on utilization analysis and prediction. We use ``Linear Predicting Method\" (LPM) and ``Flat Period Reservation-Reduced Method\" (FPRRM) to get useful information from the resource utilization log, and make M/M/1 queuing theory predicting method have better response time and less energy-consuming. Experimental evaluation performed on CloudSim cloud simulator shows that the proposed methods can effectively reduce the violation rate and energy-consuming in the cloud.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127789176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaozu Dong, Dongxiao Xu, Yang Zhang, Guangdeng Liao
{"title":"Optimizing Network I/O Virtualization with Efficient Interrupt Coalescing and Virtual Receive Side Scaling","authors":"Yaozu Dong, Dongxiao Xu, Yang Zhang, Guangdeng Liao","doi":"10.1109/CLUSTER.2011.12","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.12","url":null,"abstract":"Virtualization is a fundamental component in cloud computing because it provides numerous guest VM transparent services, such as live migration, high availability, rapid checkpoint, etc. However, I/O virtualization, particularly for network, is still suffering from significant performance degradation. In this paper, we analyze performance challenges in network I/O virtualization and observe that the conventional network I/O virtualization incurs excessive virtual interrupts to guest VMs, and the backend driver in the driver domain is not parallelized and cannot leverage underlying multi-core processors. Motivated by the above observations, we propose optimizations: efficient interrupt coalescing for network I/O virtualization and virtual receive side scaling to effectively leverage multi-core processors. We implemented those optimizations in Xen and did extensive performance evaluation. Our experimental results reveal that the proposed optimizations significantly improve network I/O virtualization performance and effectively tackle the performance challenges.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134232427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiphase LBM Distributed over Multiple GPUs","authors":"C. Rosales","doi":"10.1109/CLUSTER.2011.9","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.9","url":null,"abstract":"A parallel distributed CUDA implementation of a Lattice Boltzmann Method for multiphase flows with large density ratios is described in this paper. Validation runs studying the terminal velocity of a rising bubble under the effect of gravity show good agreement with the expected theoretical values. The code is benchmarked against the performance of a typical CPU implementation of the same algorithm on both AMD and Intel platforms, and a single GPU is observed to perform up to 10X faster than a quad-core CPU socket, a 40X speedup with respect to a single core. The code is shown to scale well when executed on multiple GPUs, which makes the port to CUDA valuable even when compared to parallel CPU implementations.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114775490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TDP-Shell: A Generic Framework to Improve Interoperability between Batch Queue Systems and Monitoring Tools","authors":"Vicente Ivars, M. A. Senar, E. Heymann","doi":"10.1109/CLUSTER.2011.73","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.73","url":null,"abstract":"Nowadays distributed applications, including MPI implementations, are executed on computer clusters managed by a batch queue system. Users take advantage of monitoring tools to detect run-time problems on their applications running on those environments. But it is a challenge to use monitoring tools on a cluster controlled by a batch queue system. This is due to the fact that batch queue systems and monitoring tools do not coordinate the management of the resources they share, when executing a distributed application. We name this problem lack of interoperability and to solve it we have developed a framework called TDP-Shell. This framework supports different batch queue systems such as Condor and SGE, and different monitoring tools such as Paradyn, Gdb and Total view, without any changes on their source code. In this paper we describe how our basic design of TDP-Shell for sequential applications was re-designed to support the monitoring of MPI applications that are executed on a cluster controlled by a batch queue system.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"199 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114370300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Model-Driven Simulation to Evaluate Performance Impact of Workload Features on Parallel Systems","authors":"T. Minh, L. Wolters","doi":"10.1109/CLUSTER.2011.27","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.27","url":null,"abstract":"Parallel workloads in practice are far from being randomly distributed, instead they are highly repetitive because users tend to run the same applications over and over again. We refer to this phenomenon as temporal locality. In addition, the workloads exhibit a correlation between runtime and parallelism (i.e., number of processors) as is analysed in this paper. According to our best knowledge, there are very few studies on the impacts of these features on the performance of parallel systems. Since these impacts are not well known, researchers often evaluate scheduling algorithms with random workloads, which neglect the phenomenon of temporal locality and the correlation. This can result in an inaccurate scheduling evaluation for parallel systems, because our study shows that these two features can significantly affect scheduling performance. In our simulation-based experiments, an increase of the correlation can quickly degrade the parallel system performance and can change the result of comparing different scheduling policies. With respect to temporal locality, we indicate that this feature does not always seriously affect schedulers of parallel systems. Instead in particular situations, it can help to improve scheduling performance. Furthermore, we also discuss in this paper the necessity of using workloads with these features in scheduling evaluation as well as how to utilize the features to enhance the performance of schedulers.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128100502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sidharth Kumar, V. Vishwanath, P. Carns, B. Summa, G. Scorzelli, Valerio Pascucci, R. Ross, Jacqueline H. Chen, H. Kolla, R. Grout
{"title":"PIDX: Efficient Parallel I/O for Multi-resolution Multi-dimensional Scientific Datasets","authors":"Sidharth Kumar, V. Vishwanath, P. Carns, B. Summa, G. Scorzelli, Valerio Pascucci, R. Ross, Jacqueline H. Chen, H. Kolla, R. Grout","doi":"10.1109/CLUSTER.2011.19","DOIUrl":"https://doi.org/10.1109/CLUSTER.2011.19","url":null,"abstract":"The IDX data format provides efficient, cache oblivious, and progressive access to large-scale scientific datasets by storing the data in a hierarchical Z (HZ) order. Data stored in IDX format can be visualized in an interactive environment allowing for meaningful explorations with minimal resources. This technology enables real-time, interactive visualization and analysis of large datasets on a variety of systems ranging from desktops and laptop computers to portable devices such as iPhones/iPads and over the web. While the existing ViSUS API for writing IDX data is serial, there are obvious advantages of applying the IDX format to the output of large scale scientific simulations. We have therefore developed PIDX - a parallel API for writing data in an IDX format. With PIDX it is now possible to generate IDX datasets directly from large scale scientific simulations with the added advantage of real-time monitoring and visualization of the generated data. In this paper, we provide an overview of the IDX file format and how it is generated using PIDX. We then present a data model description and a novel aggregation strategy to enhance the scalability of the PIDX library. The S3D combustion application is used as an example to demonstrate the efficacy of PIDX for a real-world scientific simulation. S3D is used for fundamental studies of turbulent combustion requiring exceptionally high fidelity simulations. PIDX achieves up to 18 GiB/s I/O throughput at 8,192 processes for S3D to write data out in the IDX format. This allows for interactive analysis and visualization of S3D data, thus, enabling in situ analysis of S3D simulation.","PeriodicalId":200830,"journal":{"name":"2011 IEEE International Conference on Cluster Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134061859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}