A. Jackson, Andrew Turner, M. Weiland, N. Johnson, O. Perks, Mark I. Parsons
{"title":"Evaluating the Arm Ecosystem for High Performance Computing","authors":"A. Jackson, Andrew Turner, M. Weiland, N. Johnson, O. Perks, Mark I. Parsons","doi":"10.1145/3324989.3325722","DOIUrl":"https://doi.org/10.1145/3324989.3325722","url":null,"abstract":"In recent years, Arm-based processors have arrived on the HPC scene, offering an alternative the existing status quo, which was largely dominated by x86 processors. In this paper, we evaluate the Arm ecosystem, both the hardware offering and the software stack that is available to users, by benchmarking a production HPC platform that uses Marvell's ThunderX2 processors. We investigate the performance of complex scientific applications across multiple nodes, and we also assess the maturity of the software stack and the ease of use from a users' perspective. This papers finds that the performance across our benchmarking applications is generally as good as, or better, than that of well-established platforms, and we can conclude from our experience that there are no major hurdles that might hinder wider adoption of this ecosystem within the HPC community.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131914207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Jaiswal, Jingwei Hu, J. Brillon, Alina A. Alexeenko
{"title":"A Discontinuous Galerkin Fast Spectral Method for Multi-Species Full Boltzmann on Streaming Multi-Processors","authors":"S. Jaiswal, Jingwei Hu, J. Brillon, Alina A. Alexeenko","doi":"10.1145/3324989.3325714","DOIUrl":"https://doi.org/10.1145/3324989.3325714","url":null,"abstract":"When the molecules of a gaseous system are far apart, say in microscale gas flows where the surface to volume ratio is high and hence the surface forces dominant, the molecule-surface interactions lead to the formation of a local thermodynamically non-equilibrium region extending few mean free paths from the surface. The dynamics of such systems is accurately described by Boltzmann equation. However, the multi-dimensional nature of Boltzmann equation presents a huge computational challenge. With the recent mathematical developments and the advent of petascale, the dynamics of full Boltzmann equation is now tractable. We present an implementation of the recently introduced multi-species discontinuous Galerkin fast spectral (DGFS) method for solving full Boltzmann on streaming multi-processors. The present implementation solves the inhomogeneous Boltzmann equation in span of few minutes, making it at least two order-of-magnitude faster than the present state-of-art stochastic method---direct simulation Monte Carlo---widely used for solving Boltzmann equation. Various performance metrics, such as weak/strong scaling have been presented. A parallel efficiency of 0.96--0.99 is demonstrated on 36 Nvidia Tesla-P100 GPUs.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127167395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Robert Searles, S. Chandrasekaran, W. Joubert, Oscar R. Hernandez
{"title":"Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures","authors":"Robert Searles, S. Chandrasekaran, W. Joubert, Oscar R. Hernandez","doi":"10.1145/3218176.3218228","DOIUrl":"https://doi.org/10.1145/3218176.3218228","url":null,"abstract":"Architectures are rapidly evolving, and exascale machines are expected to offer billion-way concurrency. We need to rethink algorithms, languages and programming models among other components in order to migrate large scale applications and explore parallelism on these machines. Although directive-based programming models allow programmers to worry less about programming and more about science, expressing complex parallel patterns in these models can be a daunting task especially when the goal is to match the performance that the hardware platforms can offer. One such pattern is wavefront. This paper extensively studies a wavefront-based miniapplication for Denovo, a production code for nuclear reactor modeling. We parallelize the Koch-Baker-Alcouffe (KBA) parallel-wavefront sweep algorithm in the main kernel of Minisweep (the miniapplication) using CUDA, OpenMP and OpenACC. Our OpenACC implementation running on NVIDIA's next-generation Volta GPU boasts an 85.06x speedup over serial code, which is larger than CUDA's 83.72x speedup over the same serial implementation. Our experimental platform includes SummitDev, an ORNL representative architecture of the upcoming Summit supercomputer. Our parallelization effort across platforms also motivated us to define an abstract parallelism model that is architecture independent, with a goal of creating software abstractions that can be used by applications employing the wavefront sweep motif.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126733535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thejaka Amila Kanewala, Marcin Zalewski, A. Lumsdaine
{"title":"Distributed, Shared-Memory Parallel Triangle Counting","authors":"Thejaka Amila Kanewala, Marcin Zalewski, A. Lumsdaine","doi":"10.1145/3218176.3218229","DOIUrl":"https://doi.org/10.1145/3218176.3218229","url":null,"abstract":"Triangles are the most basic non-trivial subgraphs. Triangle counting is used in a number of different applications, including social network mining, cyber security, and spam detection. In general, triangle counting algorithms are readily parallelizable, but when implemented in distributed, shared-memory, their performance is poor due to high communication, imbalance of work, and the difficulty of exploiting locality available in shared memory. In this paper, we discuss four different (but related) triangle counting algorithms and how their performance can be improved in distributed, shared-memory by reducing in-node load imbalance, improving cache utilization, minimizing network overhead, and minimizing algorithmic work. We generalize the four different triangle counting algorithms into a common framework and show that for all four algorithms the in-node load imbalance can be minimized while utilizing caches by partitioning work into blocks of vertices, the network overhead can be minimized by aggregation of blocks of work, and algorithm work can be reduced by partitioning vertex neighbors by degree. We experimentally evaluate the weak and the strong scaling performance of the proposed algorithms with two types of synthetic graph inputs and three real-world graph inputs. We also compare the performance of our implementations with the distributed, shared-memory triangle counting algorithms available in PowerGraph-GraphLab and show that our proposed algorithms outperform those algorithms, both in terms of space and time.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115525014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka, K. Miura, J. Shalf
{"title":"MRG8","authors":"Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka, K. Miura, J. Shalf","doi":"10.1145/3218176.3218230","DOIUrl":"https://doi.org/10.1145/3218176.3218230","url":null,"abstract":"Pseudo random number generators (PRNGs) are crucial for numerous applications in HPC ranging from molecular dynamics to quantum chemistry, and hydrodynamics. These applications require high throughput and good statistical quality from the PRNGs - especially for parallel computing where long pseudo-random sequences can be exhausted rapidly. Although a handful PRNGs have been adapted to parallel computing, they do not fully exploit the features of wide-SIMD many-core processors and GPU accelerators in modern supercomputers. Multiple Recursive Generators (MRGs) are a family of random number generators based on higher order recursion, which provide statistically high-quality random number sequences with extremely long-recurrence lengths, and deterministic jump-ahead for effective parallelism. We reformulate the MRG8 (8th-order recursive implementation) for Intel's KNL and NVIDIA's P100 GPU - named MRG8-AVX512 and MRG8-GPU respectively. Our optimized implementation generates the same random number sequence as the original well-characterized MRG8. We evaluated MRG8-AVX512 and MRG8-GPU together with vender tuned random number generators for Intel KNL and GPU. MRG8-AVX512 achieves a substantial 69% improvement compared to Intel's MKL, and MRG8-GPU shows a maximum 3.36x speedup compared to NVIDIA's cuRAND library.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122554797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the Platform for Advanced Scientific Computing Conference","authors":"","doi":"10.1145/3218176","DOIUrl":"https://doi.org/10.1145/3218176","url":null,"abstract":"","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122657666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Toby Simpson, D. Pasadakis, D. Kourounis, K. Fujita, Takuma Yamaguchi, T. Ichimura, O. Schenk
{"title":"Balanced Graph Partition Refinement using the Graph p-Laplacian","authors":"Toby Simpson, D. Pasadakis, D. Kourounis, K. Fujita, Takuma Yamaguchi, T. Ichimura, O. Schenk","doi":"10.1145/3218176.3218232","DOIUrl":"https://doi.org/10.1145/3218176.3218232","url":null,"abstract":"A continuous formulation of the optimal 2-way graph partitioning based on the p-norm minimization of the graph Laplacian Rayleigh quotient is presented, which provides a sharp approximation to the balanced graph partitioning problem, the optimality of which is known to be NP-hard. The minimization is initialized from a cut provided by a state-of-the-art multilevel recursive bisection algorithm, and then a continuation approach reduces the p-norm from a 2-norm towards a 1-norm, employing for each value of p a feasibility-preserving steepest-descent method that converges on the p-Laplacian eigenvector. A filter favors iterates advancing towards minimum edgecut and partition load imbalance. The complexity of the suggested approach is linear in graph edges. The simplicity of the steepest-descent algorithm renders the overall approach highly scalable and efficient in parallel distributed architectures. Parallel implementation of recursive bisection on multi-core CPUs and GPUs are presented for large-scale graphs with up to 1.9 billion tetrahedra. The suggested approach exhibits improvements of up to 52.8% over METIS for graphs originating from triangular Delaunay meshes, 34.7% over METIS and 21.9% over KaHIP for power network graphs, 40.8% over METIS and 20.6% over KaHIP for sparse matrix graphs, and finally 93.2% over METIS for graphs emerging from social networks.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131117744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Valentin Clement, S. Ferrachat, O. Fuhrer, X. Lapillonne, C. Osuna, R. Pincus, Jonathan S. Rood, W. Sawyer
{"title":"The CLAW DSL: Abstractions for Performance Portable Weather and Climate Models","authors":"Valentin Clement, S. Ferrachat, O. Fuhrer, X. Lapillonne, C. Osuna, R. Pincus, Jonathan S. Rood, W. Sawyer","doi":"10.1145/3218176.3218226","DOIUrl":"https://doi.org/10.1145/3218176.3218226","url":null,"abstract":"In order to profit from emerging high-performance computing systems, weather and climate models need to be adapted to run efficiently on different hardware architectures such as accelerators. This is a major challenge for existing community models that represent very large code bases written in Fortran. We introduce the CLAW domain-specific language (CLAW DSL) and the CLAW Compiler that allows the retention of a single code written in Fortran and achieve a high degree of performance portability. Specifically, we present the Single Column Abstraction (SCA) of the CLAW DSL that is targeted at the column-based algorithmic motifs typically encountered in the physical parameterizations of weather and climate models. Starting from a serial and non-optimized source code, the CLAW Compiler applies transformations and optimizations for a specific target hardware architecture and generates parallel optimized Fortran code annotated with OpenMP or OpenACC directives. Results from a state-of-the-art radiative transfer code, indicate that using CLAW, the amount of source code can be significantly reduced while achieving efficient code for x86 multi-core CPUs and GPU accelerators. The CLAW DSL is a significant step towards performance portable climate and weather model and could be adopted incrementally in existing code with limited effort.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127032435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extreme Computing for Extreme Adaptive Optics: The Key to Finding Life Outside our Solar System","authors":"H. Ltaief, D. Sukkari, O. Guyon, D. Keyes","doi":"10.1145/3218176.3218225","DOIUrl":"https://doi.org/10.1145/3218176.3218225","url":null,"abstract":"The real-time correction of telescopic images in the search for exoplanets is highly sensitive to atmospheric aberrations. The pseudo-inverse algorithm is an efficient mathematical method to filter out these turbulences. We introduce a new partial singular value decomposition (SVD) algorithm based on QR-based Diagonally Weighted Halley (QDWH) iteration for the pseudo-inverse method of adaptive optics. The QDWH partial SVD algorithm selectively calculates the most significant singular values and their corresponding singular vectors. We develop a high performance implementation and demonstrate the numerical robustness of the QDWH-based partial SVD method. We also perform a benchmarking campaign on various generations of GPU hardware accelerators and compare against the state-of-the-art SVD implementation SGESDD from the MAGMA library. Numerical accuracy and performance results are reported using synthetic and real observational datasets from the Subaru telescope. Our implementation outperforms SGESDD by up to fivefold and fourfold performance speedups on ill-conditioned synthetic matrices and real observational datasets, respectively. The pseudo-inverse simulation code will be deployed on-sky for the Subaru telescope during observation nights scheduled early 2018.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"310 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116489003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Asynchronous Task-Based Parallelization of Algebraic Multigrid","authors":"Amani Alonazi, George S. Markomanolis, D. Keyes","doi":"10.1145/3093172.3093230","DOIUrl":"https://doi.org/10.1145/3093172.3093230","url":null,"abstract":"As processor clock rates become more dynamic and workloads become more adaptive, the vulnerability to global synchronization that already complicates programming for performance in today's petascale environment will be exacerbated. Algebraic multigrid (AMG), the solver of choice in many large-scale PDE-based simulations, scales well in the weak sense, with fixed problem size per node, on tightly coupled systems when loads are well balanced and core performance is reliable. However, its strong scaling to many cores within a node is challenging. Reducing synchronization and increasing concurrency are vital adaptations of AMG to hybrid architectures. Recent communication-reducing improvements to classical additive AMG by Vassilevski and Yang improve concurrency and increase communication-computation overlap, while retaining convergence properties close to those of standard multiplicative AMG, but remain bulk synchronous. We extend the Vassilevski and Yang additive AMG to asynchronous task-based parallelism using a hybrid MPI+OmpSs (from the Barcelona Supercomputer Center) within a node, along with MPI for internode communications. We implement a tiling approach to decompose the grid hierarchy into parallel units within task containers. We compare against the MPI-only BoomerAMG and the Auxiliary-space Maxwell Solver (AMS) in the hypre library for the 3D Laplacian operator and the electromagnetic diffusion, respectively. In time to solution for a full solve an MPI-OmpSs hybrid improves over an all-MPI approach in strong scaling at full core count (32 threads per single Haswell node of the Cray XC40) and maintains this per node advantage as both weak scale to thousands of cores, with MPI between nodes.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129933417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}