{"title":"Real Time Visualization of Monitoring Data for Large Scale HPC Systems","authors":"M. Showerman","doi":"10.1109/CLUSTER.2015.122","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.122","url":null,"abstract":"High Performance Computing (HPC) system users and administrators are often hampered in their ability understand application performance and system behavior due to a lack of sufficient information about how resources, such as memory, CPU, networks and filesystems are being used. While obtaining the related data is a necessary step, it is insufficient without tools that can turn the data into actionable information. Required capabilities of such tools are the ability to efficiently handle vast amounts of data in a timely fashion, the presentation of effective and understandable information representations for large node counts, and the correlation of that data with job and system events. This paper presents visualization approaches and tools that NCSA is developing, combined with the use of freely available web interfaces, to turn the eight billion platform related data points per day being collected from their 27,648 compute node Blue Waters platform into actionable information for both system administrators and users. Insights from the visualizations both at the system and the job levels are also presented.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"296 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124246783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Spatial Information in Datasets to Enable Fault Tolerant Sparse Matrix Solvers","authors":"Rob Hunt, Simon McIntosh-Smith","doi":"10.1109/CLUSTER.2015.102","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.102","url":null,"abstract":"High-performance computing (HPC) systems continue to increase in size in the quest for ever higher performance. The resulting increased electronic component count, coupled with the decrease in feature sizes of the silicon manufacturing processes used to build these components, will result in future Exascale systems being more susceptible to soft errors caused by cosmic radiation than current HPC systems. Through the use of techniques such as hardware-based error-correcting codes (ECC) and checkpoint-restart, many of these faults can be mitigated, but at the cost of increased hardware overhead, run-time, and energy consumption that can be as much as 10 - 20%. For extreme scale systems, these overheads will represent megawatts of power consumption and millions of dollars of additional hardware cost, which could potentially be avoided with more sophisticated fault-tolerance techniques. In this paper we present a new software-based fault tolerance technique that can be applied to one of the most important classes of software in HPC: sparse matrix solvers. Our new technique enables us to exploit knowledge of the structure of sparse matrices in such a way as to improve the performance, energy efficiency and fault tolerance of the overall solution.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117351811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introducing and Exploiting Hierarchical Structural Information","authors":"D. Bonilla, C. W. Glass, J. Kuper, R. D. Groote","doi":"10.1109/CLUSTER.2015.133","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.133","url":null,"abstract":"This paper presents a programming model approach that explicitly addresses the programmability of scientific code by annotating imperative code with its algorithmic structural behavior. This information is used to create hierarchical structures, as opposed to the flat structure that most programming models work with, which allows sound code transformation at any level of the code, adjusting the granularity of parallelization simultaneously with other parameters to better exploit the available hardware resources.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122032397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Execution of Computationally Intensive CPU-Based Libraries on Remote Accelerators for Increasing Performance: Early Experience with the OpenBLAS and FFTW Libraries","authors":"S. Valero, F. Silla","doi":"10.1109/CLUSTER.2015.111","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.111","url":null,"abstract":"Virtualization techniques have shown to report benefits to data centers and other computing facilities. In this regard, virtual machines not only allow reducing the size of the computing infrastructure while increasing overall resource utilization but virtualizing individual components of computers may also provide significant benefits. This is the case, for example, for the remote GPU virtualization technique, implemented in several frameworks during the last years. In this paper we present an initial implementation of a new middleware for the remote virtualization of another component of computers: the CPU itself. Our proposal uses remote accelerators to perform computations that were initially intended to be carried out in the local CPUs, doing so transparently to the application and without having to modify its source code. By making use of the OpenBLAS and FFTW libraries as case studies to show the performance gains of our proposal, we carry out a performance evaluation targeting several system configurations comprising Xeon processors as well as Ethernet and InfiniBand QDR, FDR, and EDR network adapters in addition to NVIDIA Tesla K40 GPUs. Results not only demonstrate that the new middleware is feasible, but they also show that mathematical libraries may experience a significant speed up, despite of having to move data forth and back to/from remote servers.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116996842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PLB-HeC: A Profile-Based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters","authors":"Luis Sant'Ana, Daniel Cordeiro, R. Camargo","doi":"10.1109/CLUSTER.2015.24","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.24","url":null,"abstract":"The use of GPU clusters for scientific applications in areas such as physics, chemistry and bioinformatics is becoming more widespread. These clusters frequently have different types of processing devices, such as CPUs and GPUs, which can themselves be heterogeneous. To use these devices in an efficient manner, it is crucial to find the right amount of work for each processor that balances the computational load among them. This problem is not only NP-hard on its essence, but also tricky due to the variety of architectures of those devices. We present PLB-HeC, a Profile-based Load-Balancing algorithm for Heterogeneous CPU-GPU Clusters that performs an online estimation of performance curve models for each GPU and CPU processor. Its main difference to existing algorithms is the generation of a non-linear system of equations representing the models and its solution using a interior point method, improving the accuracy of block distribution among processing units. We implemented the algorithm in the StarPU framework and compared its performance with existing load-balancing algorithms using applications from linear algebra, stock markets and bioinformatics. We show that it reduces the application execution times in almost all scenarios, when using heterogeneous clusters with two or more machine configurations.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127127463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Empirical Comparison of Three Versioning Architectures","authors":"H. Fujita, K. Iskra, P. Balaji, A. Chien","doi":"10.1109/CLUSTER.2015.69","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.69","url":null,"abstract":"Future supercomputer systems will face serious reliability challenges. Among failure scenarios, latent errors are some of the most serious and concerning. Preserving multiple versions of critical data is a promising approach to deal with such errors. We are developing the Global View Resilience (GVR) library, with multi-version global arrays as one of the key features. This paper presents three array versioning architectures: flat array, flat array with change tracking, and log-structured array. We use a synthetic workload comparing the three array architectures in terms of runtime performance and memory requirements. The experiments show that the flat array with change tracking is the best architecture in terms of runtime performance, for versioning frequencies of 10-5 ops-1 or higher matching the second best architecture or beating it by over 8 times, whereas the log-structured array is preferable for low memory usage, since it saves up to 88% of memory compared with a flat array.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132861121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Ivanov, Jing Gong, D. Akhmetova, I. Peng, S. Markidis, E. Laure, Rui Machado, M. Rahn, Valeria Bartsch, A. Hart, P. Fischer
{"title":"Evaluation of Parallel Communication Models in Nekbone, a Nek5000 Mini-Application","authors":"I. Ivanov, Jing Gong, D. Akhmetova, I. Peng, S. Markidis, E. Laure, Rui Machado, M. Rahn, Valeria Bartsch, A. Hart, P. Fischer","doi":"10.1109/CLUSTER.2015.131","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.131","url":null,"abstract":"Nekbone is a proxy application of Nek5000, a scalable Computational Fluid Dynamics (CFD) code used for modelling incompressible flows. The Nekbone mini-application is used by several international co-design centers to explore new concepts in computer science and to evaluate their performance. We present the design and implementation of a new communication kernel in the Nekbone mini-application with the goal of studying the performance of different parallel communication models. First, a new MPI blocking communication kernel has been developed to solve Nekbone problems in a three-dimensional Cartesian mesh and process topology. The new MPI implementation delivers a 13% performance improvement compared to the original implementation. The new MPI communication kernel consists of approximately 500 lines of code against the original 7,000 lines of code, allowing experimentation with new approaches in Nekbone parallel communication. Second, the MPI blocking communication in the new kernel was changed to the MPI non-blocking communication. Third, we developed a new Partitioned Global Address Space (PGAS) communication kernel, based on the GPI-2 library. This approach reduces the synchronization among neighbor processes and is on average 3% faster than the new MPI-based, non-blocking, approach. In our tests on 8,192 processes, the GPI-2 communication kernel is 3% faster than the new MPI non-blocking communication kernel. In addition, we have used the OpenMP in all the versions of the new communication kernel. Finally, we highlight the future steps for using the new communication kernel in the parent application Nek5000.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130957759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evolution of Monitoring over the Lifetime of a High Performance Computing Cluster","authors":"A. DeConinck, K. Kelly","doi":"10.1109/CLUSTER.2015.123","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.123","url":null,"abstract":"High Performance Computer (HPC) systems typically have lifetimes of four to six years. During this lifetime a system will undergo substantial changes in the system software stack and hardware configuration. Simultaneously, the physical environment around it will change as old systems are retired and new systems are brought in. This report focuses on our experience with Mustang, a 1600 node Linux cluster at LANL. Over the three years we have operated Mustang, the machine and environment have changed substantially, which has resulted in reliability and stability issues on the cluster. In this report we present our experiences with standard monitoring and analysis tools available on Mustang since its installation, and how recent advances in our tools and usage have improved our ability to troubleshoot these issues and perform timely root cause analysis. These advances have both improved our management of existing installations as well as informed our hardware and tooling requirements for future systems.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123465324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Khaled Hamidouche, Akshay Venkatesh, A. Awan, H. Subramoni, Ching-Hsiang Chu, D. Panda
{"title":"Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters","authors":"Khaled Hamidouche, Akshay Venkatesh, A. Awan, H. Subramoni, Ching-Hsiang Chu, D. Panda","doi":"10.1109/CLUSTER.2015.21","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.21","url":null,"abstract":"GPUDirect RDMA (GDR) brings the high-performance communication capabilities of RDMA networks like InfiniBand (IB) to GPUs (referred to as \"Device\"). It enables IB network adapters to directly write/read data to/from GPU memory. Partitioned Global Address Space (PGAS) programming models, such as OpenSHMEM, provide an attractive approach for developing scientific applications with irregular communication characteristics by providing shared memory address space abstractions, along with one-sided communication semantics. However, current approaches and designs of OpenSHMEM on GPU clusters do not take advantage of the GDR features leading to inefficiencies and sub-optimal performance. In this paper, we analyze the performance of various OpenSHMEM operations with different inter-node and intra-node communication configurations (Host-to-Device, Device-to-Device, and Device-to-Host) on GPU based systems. We propose novel designs that ensure \"truly one-sided\" communication for the different inter-/intra-node configurations identified above while working around the hardware limitations. To the best of our knowledge, this is the first work that investigates GDR-aware designs for OpenSHMEM communication operations. Experimental evaluations indicate 2.5X and 7X improvement in point-point communication for intra-node and inter-node, respectively. The proposed framework achieves 2.2μs for an intra-node 8 byte put operation from Host-to-Device, and 3.13μs for an inter-node 8 byte put operation from GPU to remote GPU. With Stencil2D application kernel from SHOC benchmark suite, we observe a 19% reduction in execution time on 64 GPU nodes. Further, for GPULBM application, we are able to improve the performance of the evolution phase by 53% and 45% on 32 and 64 GPU nodes, respectively.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126056270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Approach to Selecting Thread + Process Mixes for Hybrid MPI + OpenMP Applications","authors":"Hormozd Gahvari, M. Schulz, U. Yang","doi":"10.1109/CLUSTER.2015.64","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.64","url":null,"abstract":"Hybrid MPI + OpenMP is a popular means of programming modern machines that feature substantial parallelism both off-node and on-node. Determining the right mix of the two programming models to use, however, is not as straightforward as simply using exclusively OpenMP on-node and limiting MPI to only inter-node communication. We present a step-by-step methodology to help make the decision of which mix of the two programming models to use. It starts with an estimate of the performance of a generic hybrid application on a given machine and incorporates additional available information about the specific application and the machine to provide guidance for selecting effective mixes of MPI processes and OpenMP threads to use when running that application on the machine in question. We validate our approach on four different applications on an IBM Blue Gene/Q, a Cray XK7, and a Cray XC30.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114214991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}