{"title":"Large Scale Frequent Pattern Mining Using MPI One-Sided Model","authors":"Abhinav Vishnu, Khushbu Agarwal","doi":"10.1109/CLUSTER.2015.30","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.30","url":null,"abstract":"In this paper, we propose a work-stealing runtime - Library for Work Stealing (LibWS) - using MPI one-sided model for designing scalable FP-Growth - defacto frequent pattern mining algorithm - on large scale systems. LibWS provides locality efficient and highly scalable work-stealing techniques for load balancing on a variety of data distributions. We also propose a novel communication algorithm for FP-growth data exchange phase, which reduces the communication complexity from state-of-the-art θ(p) to θ(f + p/f), for p processes and f frequent attributed-ids. FP-Growth is implemented using LibWS and evaluated on several work distributions and support counts. An experimental evaluation of the FP-Growth on LibWS using 4096 processes on an InfiniBand Cluster demonstrates excellent efficiency for several work distributions (91% efficiency for Power-law and 93% for Poisson). The proposed distributed FPTree merging algorithm provides 38x communication speedup on 4096 cores.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"11218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132775416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrián Castelló, Antonio J. Peña, R. Mayo, P. Balaji, E. S. Quintana‐Ortí
{"title":"Exploring the Suitability of Remote GPGPU Virtualization for the OpenACC Programming Model Using rCUDA","authors":"Adrián Castelló, Antonio J. Peña, R. Mayo, P. Balaji, E. S. Quintana‐Ortí","doi":"10.1109/CLUSTER.2015.23","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.23","url":null,"abstract":"OpenACC is an application programming interface (API) that aims to unleash the power of heterogeneous systems composed of CPUs and accelerators such as graphic processing units (GPUs) or Intel Xeon Phi coprocessors. This directive-based programming model is intended to enable developers to accelerate their application's execution with much less effort. Coprocessors offer significant computing power but in many cases these devices remain largely underused because not all parts of applications match the accelerator architecture. Remote accelerator virtualization frameworks introduce a means to address this problem. In particular, the remote CUDA virtualization middleware rCUDA provides transparent remote access to any GPU installed in a cluster. Combining these two technologies, OpenACC and rCUDA, in a single scenario is naturally appealing. In this work we explore how the different OpenACC directives behave on top of a remote GPGPU virtualization technology in two different hardware configurations. Our experimental evaluation reveals favorable performance results when the two technologies are combined, showing low overhead and similar scaling factors when executing OpenACC-enabled directives.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129379015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GO-Docker: A Batch Scheduling System with Docker Containers","authors":"Olivier Sallou, Cyril Monjeaud","doi":"10.1109/CLUSTER.2015.89","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.89","url":null,"abstract":"Multi user open source batch scheduling software based on Docker containers with custom scheduler and executor plugin mechanisms.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129251221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Explicit Hydrodynamics for Power, Energy, and Performance","authors":"E. León, I. Karlin, Ryan E. Grant","doi":"10.1109/CLUSTER.2015.12","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.12","url":null,"abstract":"Practical considerations for future supercomputer designs will impose limits on both instantaneous power consumption and total energy consumption. Working within these constraints while providing the maximum possible performance, application developers will need to optimize their code for speed alongside power and energy concerns. This paper analyzes the effectiveness of several code optimizations including loop fusion, data structure transformations, and global allocations. A per component measurement and analysis of different architectures is performed, enabling the examination of code optimizations on different compute subsystems. Using an explicit hydrodynamics proxy application from the U.S. Department of Energy, LULESH, we show how code optimizations impact different computational phases of the simulation. This provides insight for simulation developers into the best optimizations to use during particular simulation compute phases when optimizing code for future supercomputing platforms. We examine and contrast both x86 and Blue Gene architectures with respect to these optimizations.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124124513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing I/O for Petascale Seismic Simulations on Unstructured Meshes","authors":"Sebastian Rettenberger, M. Bader","doi":"10.1109/CLUSTER.2015.51","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.51","url":null,"abstract":"SeisSol simulates earthquake dynamics by coupling seismic wave propagation and dynamic rupture simulations with high order accuracy on fully adaptive, unstructured meshes. In this paper we present an optimization of SeisSol's I/O implementations to establish a workflow that supports petascale simulations on large unstructured datasets. Our implementations can handle meshes with more than 1 billion cells and 660 billion degrees of reedom. The results show that SeisSol can initialize the mesh structure within 35 seconds on 2048 SuperMUC nodes from our new optimized mesh format. For the wave field output we implemented carefully tuned I/O routines based on HDF5 and MPI-IO. With an aggregation strategy we are able to increase the write bandwidth from 832 MiB/s to 6.7 GiB/s on 2048 SuperMUC nodes.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"273 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121360562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Relativistic High-Resolution Shock-Capturing for Heterogeneous Computing","authors":"F. Glines, Matthew Anderson, D. Neilsen","doi":"10.1109/CLUSTER.2015.110","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.110","url":null,"abstract":"A shift is underway in high performance computing (HPC) towards heterogeneous parallel architectures that emphasize medium and fine grain thread parallelism. Many scientific computing algorithms, including simple finite-differencing methods, have already been mapped to heterogeneous architectures with order-of-magnitude gains in performance as a result. Recent case studies examining high-resolution shock-capturing (HRSC) algorithms suggest that these finite-volume methods are good candidates for emerging heterogeneous architectures. HRSC methods form a key scientific kernel for compressible inviscid solvers that appear in astrophysics and engineering applications and tend to require enormous memory and computing resources. This work presents a case study of an HRSC method executed on a heterogeneous parallel architecture utilizing hundreds of GPU enabled nodes with remote direct memory access to the GPUs for a non-trivial shock application using the relativistic magnetohydrodynamics model.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126223668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enabling Tractable Exploration of the Performance of Adaptive Mesh Refinement","authors":"C. Vaughan, R. Barrett","doi":"10.1109/CLUSTER.2015.129","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.129","url":null,"abstract":"A broad range of physical phenomena in science and engineering can be explored using finite difference and volume based application codes. Incorporating Adaptive Mesh Refinement (AMR) into these codes focuses attention on the most critical parts of a simulation, enabling increased numerical accuracy of the solution while limiting memory consumption. However, adaptivity comes at the cost of increased runtime complexity, which is particularly challenging on emerging and expected future architectures. In order to explore the design space offered by new computing environments, we have developed a proxy application called miniAMR. MiniAMR exposes a range of the important issues that will significantly impact the performance potential of full application codes. In this paper, we describe miniAMR, demonstrate what is designed to represent in a full application code, and illustrate how it can be used to exploit future high performance computing architectures. To ensure an accurate understanding of what miniAMR is intended to represent, we compare it with CTH, a shock hydrodynamics code in heavy use throughout several computational science and engineering communities.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126433867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Babak Behzad, S. Byna, Stefan M. Wild, Prabhat, M. Snir
{"title":"Dynamic Model-Driven Parallel I/O Performance Tuning","authors":"Babak Behzad, S. Byna, Stefan M. Wild, Prabhat, M. Snir","doi":"10.1109/CLUSTER.2015.37","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.37","url":null,"abstract":"Parallel I/O performance depends highly on the interactions among multiple layers of the parallel I/O stack. The most common layers include high-level I/O libraries, MPI-IO middleware, and parallel file system. Each of these layers offers various tunable parameters to control intermediary data transfer points and the final data layout. Due to the interdependencies and the number of combinations of parameters, finding a good set of parameter values for a specific application's I/O pattern is challenging. Recent efforts, such as autotuning with genetic algorithms (GAs) and analytical models, have several limitations. For instance, analytical models fail to capture the dynamic nature of shared supercomputing systems and are application-specific. GA-based tuning requires running many time-consuming experiments for each input size. In this paper, we present a strategy to generate automatically an empirical model for a given application pattern. Using a set of real measurements from running an I/O kernel as training set, we generate a nonlinear regression model. We use this model to predict the top-20 tunable parameter values that give efficient I/O performance and rerun the I/O kernel to select the best set of parameter under the current conditions as tunable parameters for future runs of the same I/O kernel. Using this approach, we demonstrate 6X - 94X speedup over default I/O time for different I/O kernels running on multiple HPC systems. We also evaluate performance by identifying interdependencies among different sets of tunable parameters.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114213226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Annie Yang, Hari Mukka, Farbod Hesaaraki, Martin Burtscher
{"title":"MPC: A Massively Parallel Compression Algorithm for Scientific Data","authors":"Annie Yang, Hari Mukka, Farbod Hesaaraki, Martin Burtscher","doi":"10.1109/CLUSTER.2015.59","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.59","url":null,"abstract":"Due to their high peak performance and energy efficiency, massively parallel accelerators such as GPUs are quickly spreading in high-performance computing, where large amounts of floating-point data are processed, transferred, and stored. Such environments can greatly benefit from data compression if done sufficiently quickly. Unfortunately, most conventional compression algorithms are unsuitable for highly parallel execution. In fact, it is generally unknown how to design good compression algorithms for massively parallel systems. To remedy this situation, we study 138,240 lossless compression algorithms for single-and double-precision floating-point values that are built exclusively from easily parallelizable components. We analyze the best of these algorithms, explain why they compress well, and derive the Massively Parallel Compression (MPC) algorithm from them. This novel algorithm requires almost no internal state, achieves heretofore unreached compression ratios on several data sets, and roughly matches the best CPU-based algorithms in compression ratio while outperforming them by one to two orders of magnitude in throughput.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115811548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SideWalk: A Facility of Lightweight Out-of-Band Communications for Augmenting Distributed Data Processing Flows","authors":"Yin Huai, Yuan Yuan, Rubao Lee, Xiaodong Zhang","doi":"10.1109/CLUSTER.2015.43","DOIUrl":"https://doi.org/10.1109/CLUSTER.2015.43","url":null,"abstract":"The foundation of a data processing engine running on a large cluster is its programming model that defines data processing operations and data movements. A special kind of communication activities that are not normally defined in the programming model but are often used in ad hoc ways in system development, is called out-of-band communications. The existing ad hoc solutions of out-of-band communications are often hard to reuse, error-prone, and not free from unwanted side effects. To address these issues, we have designed and implemented a standalone facility of out-of-band communications called SideWalk. With this facility, users can add out-of-band communication operations into their distributed data flows through a set of reusable APIs. These APIs have well defined semantics and thus, users' chances of writing error-prone programs with SideWalk are minimized. To prevent users from introducing unwanted side effects while using SideWalk, we prototype SideWalk to efficiently handle lightweight out-of-band communications and we restrict communication patterns that can be conducted through SideWalk without affecting the applicability of SideWalk on typical use cases. Our experimental results show that execution times of distributed data processing flows in a Hadoop environment with out-of-band communications implemented with SideWalk are reduced up to 1.53 times compared with that of distributed data processing flows with out-of-band communications implemented with a representative ad hoc solution.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127850355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}