Proceedings of the Platform for Advanced Scientific Computing Conference最新文献_第10页

Evaluating the Arm Ecosystem for High Performance Computing 评估面向高性能计算的Arm生态系统

Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2019-04-08 DOI: 10.1145/3324989.3325722

A. Jackson, Andrew Turner, M. Weiland, N. Johnson, O. Perks, Mark I. Parsons

引用次数: 16

A Discontinuous Galerkin Fast Spectral Method for Multi-Species Full Boltzmann on Streaming Multi-Processors 流多处理器上多种全玻尔兹曼的不连续Galerkin快速谱方法

Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2019-03-12 DOI: 10.1145/3324989.3325714

S. Jaiswal, Jingwei Hu, J. Brillon, Alina A. Alexeenko

{"title":"A Discontinuous Galerkin Fast Spectral Method for Multi-Species Full Boltzmann on Streaming Multi-Processors","authors":"S. Jaiswal, Jingwei Hu, J. Brillon, Alina A. Alexeenko","doi":"10.1145/3324989.3325714","DOIUrl":"https://doi.org/10.1145/3324989.3325714","url":null,"abstract":"When the molecules of a gaseous system are far apart, say in microscale gas flows where the surface to volume ratio is high and hence the surface forces dominant, the molecule-surface interactions lead to the formation of a local thermodynamically non-equilibrium region extending few mean free paths from the surface. The dynamics of such systems is accurately described by Boltzmann equation. However, the multi-dimensional nature of Boltzmann equation presents a huge computational challenge. With the recent mathematical developments and the advent of petascale, the dynamics of full Boltzmann equation is now tractable. We present an implementation of the recently introduced multi-species discontinuous Galerkin fast spectral (DGFS) method for solving full Boltzmann on streaming multi-processors. The present implementation solves the inhomogeneous Boltzmann equation in span of few minutes, making it at least two order-of-magnitude faster than the present state-of-art stochastic method---direct simulation Monte Carlo---widely used for solving Boltzmann equation. Various performance metrics, such as weak/strong scaling have been presented. A parallel efficiency of 0.96--0.99 is demonstrated on 36 Nvidia Tesla-P100 GPUs.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127167395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures 使波前算法适应未来架构的抽象和指令

Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176.3218228

Robert Searles, S. Chandrasekaran, W. Joubert, Oscar R. Hernandez

{"title":"Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures","authors":"Robert Searles, S. Chandrasekaran, W. Joubert, Oscar R. Hernandez","doi":"10.1145/3218176.3218228","DOIUrl":"https://doi.org/10.1145/3218176.3218228","url":null,"abstract":"Architectures are rapidly evolving, and exascale machines are expected to offer billion-way concurrency. We need to rethink algorithms, languages and programming models among other components in order to migrate large scale applications and explore parallelism on these machines. Although directive-based programming models allow programmers to worry less about programming and more about science, expressing complex parallel patterns in these models can be a daunting task especially when the goal is to match the performance that the hardware platforms can offer. One such pattern is wavefront. This paper extensively studies a wavefront-based miniapplication for Denovo, a production code for nuclear reactor modeling. We parallelize the Koch-Baker-Alcouffe (KBA) parallel-wavefront sweep algorithm in the main kernel of Minisweep (the miniapplication) using CUDA, OpenMP and OpenACC. Our OpenACC implementation running on NVIDIA's next-generation Volta GPU boasts an 85.06x speedup over serial code, which is larger than CUDA's 83.72x speedup over the same serial implementation. Our experimental platform includes SummitDev, an ORNL representative architecture of the upcoming Summit supercomputer. Our parallelization effort across platforms also motivated us to define an abstract parallelism model that is architecture independent, with a goal of creating software abstractions that can be used by applications employing the wavefront sweep motif.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126733535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Distributed, Shared-Memory Parallel Triangle Counting 分布式、共享内存并行三角形计数

Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176.3218229

Thejaka Amila Kanewala, Marcin Zalewski, A. Lumsdaine

{"title":"Distributed, Shared-Memory Parallel Triangle Counting","authors":"Thejaka Amila Kanewala, Marcin Zalewski, A. Lumsdaine","doi":"10.1145/3218176.3218229","DOIUrl":"https://doi.org/10.1145/3218176.3218229","url":null,"abstract":"Triangles are the most basic non-trivial subgraphs. Triangle counting is used in a number of different applications, including social network mining, cyber security, and spam detection. In general, triangle counting algorithms are readily parallelizable, but when implemented in distributed, shared-memory, their performance is poor due to high communication, imbalance of work, and the difficulty of exploiting locality available in shared memory. In this paper, we discuss four different (but related) triangle counting algorithms and how their performance can be improved in distributed, shared-memory by reducing in-node load imbalance, improving cache utilization, minimizing network overhead, and minimizing algorithmic work. We generalize the four different triangle counting algorithms into a common framework and show that for all four algorithms the in-node load imbalance can be minimized while utilizing caches by partitioning work into blocks of vertices, the network overhead can be minimized by aggregation of blocks of work, and algorithm work can be reduced by partitioning vertex neighbors by degree. We experimentally evaluate the weak and the strong scaling performance of the proposed algorithms with two types of synthetic graph inputs and three real-world graph inputs. We also compare the performance of our implementations with the distributed, shared-memory triangle counting algorithms available in PowerGraph-GraphLab and show that our proposed algorithms outperform those algorithms, both in terms of space and time.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115525014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

MRG8 MRG8

Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176.3218230

Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka, K. Miura, J. Shalf

{"title":"MRG8","authors":"Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka, K. Miura, J. Shalf","doi":"10.1145/3218176.3218230","DOIUrl":"https://doi.org/10.1145/3218176.3218230","url":null,"abstract":"Pseudo random number generators (PRNGs) are crucial for numerous applications in HPC ranging from molecular dynamics to quantum chemistry, and hydrodynamics. These applications require high throughput and good statistical quality from the PRNGs - especially for parallel computing where long pseudo-random sequences can be exhausted rapidly. Although a handful PRNGs have been adapted to parallel computing, they do not fully exploit the features of wide-SIMD many-core processors and GPU accelerators in modern supercomputers. Multiple Recursive Generators (MRGs) are a family of random number generators based on higher order recursion, which provide statistically high-quality random number sequences with extremely long-recurrence lengths, and deterministic jump-ahead for effective parallelism. We reformulate the MRG8 (8th-order recursive implementation) for Intel's KNL and NVIDIA's P100 GPU - named MRG8-AVX512 and MRG8-GPU respectively. Our optimized implementation generates the same random number sequence as the original well-characterized MRG8. We evaluated MRG8-AVX512 and MRG8-GPU together with vender tuned random number generators for Intel KNL and GPU. MRG8-AVX512 achieves a substantial 69% improvement compared to Intel's MKL, and MRG8-GPU shows a maximum 3.36x speedup compared to NVIDIA's cuRAND library.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122554797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Proceedings of the Platform for Advanced Scientific Computing Conference 先进科学计算平台会议录

Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176

引用次数: 2

Balanced Graph Partition Refinement using the Graph p-Laplacian 基于图p-拉普拉斯算子的平衡图划分精化

Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176.3218232

Toby Simpson, D. Pasadakis, D. Kourounis, K. Fujita, Takuma Yamaguchi, T. Ichimura, O. Schenk

{"title":"Balanced Graph Partition Refinement using the Graph p-Laplacian","authors":"Toby Simpson, D. Pasadakis, D. Kourounis, K. Fujita, Takuma Yamaguchi, T. Ichimura, O. Schenk","doi":"10.1145/3218176.3218232","DOIUrl":"https://doi.org/10.1145/3218176.3218232","url":null,"abstract":"A continuous formulation of the optimal 2-way graph partitioning based on the p-norm minimization of the graph Laplacian Rayleigh quotient is presented, which provides a sharp approximation to the balanced graph partitioning problem, the optimality of which is known to be NP-hard. The minimization is initialized from a cut provided by a state-of-the-art multilevel recursive bisection algorithm, and then a continuation approach reduces the p-norm from a 2-norm towards a 1-norm, employing for each value of p a feasibility-preserving steepest-descent method that converges on the p-Laplacian eigenvector. A filter favors iterates advancing towards minimum edgecut and partition load imbalance. The complexity of the suggested approach is linear in graph edges. The simplicity of the steepest-descent algorithm renders the overall approach highly scalable and efficient in parallel distributed architectures. Parallel implementation of recursive bisection on multi-core CPUs and GPUs are presented for large-scale graphs with up to 1.9 billion tetrahedra. The suggested approach exhibits improvements of up to 52.8% over METIS for graphs originating from triangular Delaunay meshes, 34.7% over METIS and 21.9% over KaHIP for power network graphs, 40.8% over METIS and 20.6% over KaHIP for sparse matrix graphs, and finally 93.2% over METIS for graphs emerging from social networks.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131117744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

The CLAW DSL: Abstractions for Performance Portable Weather and Climate Models CLAW DSL:性能便携式天气和气候模型的抽象

Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176.3218226

Valentin Clement, S. Ferrachat, O. Fuhrer, X. Lapillonne, C. Osuna, R. Pincus, Jonathan S. Rood, W. Sawyer

{"title":"The CLAW DSL: Abstractions for Performance Portable Weather and Climate Models","authors":"Valentin Clement, S. Ferrachat, O. Fuhrer, X. Lapillonne, C. Osuna, R. Pincus, Jonathan S. Rood, W. Sawyer","doi":"10.1145/3218176.3218226","DOIUrl":"https://doi.org/10.1145/3218176.3218226","url":null,"abstract":"In order to profit from emerging high-performance computing systems, weather and climate models need to be adapted to run efficiently on different hardware architectures such as accelerators. This is a major challenge for existing community models that represent very large code bases written in Fortran. We introduce the CLAW domain-specific language (CLAW DSL) and the CLAW Compiler that allows the retention of a single code written in Fortran and achieve a high degree of performance portability. Specifically, we present the Single Column Abstraction (SCA) of the CLAW DSL that is targeted at the column-based algorithmic motifs typically encountered in the physical parameterizations of weather and climate models. Starting from a serial and non-optimized source code, the CLAW Compiler applies transformations and optimizations for a specific target hardware architecture and generates parallel optimized Fortran code annotated with OpenMP or OpenACC directives. Results from a state-of-the-art radiative transfer code, indicate that using CLAW, the amount of source code can be significantly reduced while achieving efficient code for x86 multi-core CPUs and GPU accelerators. The CLAW DSL is a significant step towards performance portable climate and weather model and could be adopted incrementally in existing code with limited effort.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127032435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Extreme Computing for Extreme Adaptive Optics: The Key to Finding Life Outside our Solar System 极端自适应光学的极端计算:寻找太阳系外生命的关键

Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176.3218225

H. Ltaief, D. Sukkari, O. Guyon, D. Keyes

{"title":"Extreme Computing for Extreme Adaptive Optics: The Key to Finding Life Outside our Solar System","authors":"H. Ltaief, D. Sukkari, O. Guyon, D. Keyes","doi":"10.1145/3218176.3218225","DOIUrl":"https://doi.org/10.1145/3218176.3218225","url":null,"abstract":"The real-time correction of telescopic images in the search for exoplanets is highly sensitive to atmospheric aberrations. The pseudo-inverse algorithm is an efficient mathematical method to filter out these turbulences. We introduce a new partial singular value decomposition (SVD) algorithm based on QR-based Diagonally Weighted Halley (QDWH) iteration for the pseudo-inverse method of adaptive optics. The QDWH partial SVD algorithm selectively calculates the most significant singular values and their corresponding singular vectors. We develop a high performance implementation and demonstrate the numerical robustness of the QDWH-based partial SVD method. We also perform a benchmarking campaign on various generations of GPU hardware accelerators and compare against the state-of-the-art SVD implementation SGESDD from the MAGMA library. Numerical accuracy and performance results are reported using synthetic and real observational datasets from the Subaru telescope. Our implementation outperforms SGESDD by up to fivefold and fourfold performance speedups on ill-conditioned synthetic matrices and real observational datasets, respectively. The pseudo-inverse simulation code will be deployed on-sky for the Subaru telescope during observation nights scheduled early 2018.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"310 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116489003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Asynchronous Task-Based Parallelization of Algebraic Multigrid 基于异步任务的代数多重网格并行化

Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2017-06-26 DOI: 10.1145/3093172.3093230

Amani Alonazi, George S. Markomanolis, D. Keyes

{"title":"Asynchronous Task-Based Parallelization of Algebraic Multigrid","authors":"Amani Alonazi, George S. Markomanolis, D. Keyes","doi":"10.1145/3093172.3093230","DOIUrl":"https://doi.org/10.1145/3093172.3093230","url":null,"abstract":"As processor clock rates become more dynamic and workloads become more adaptive, the vulnerability to global synchronization that already complicates programming for performance in today's petascale environment will be exacerbated. Algebraic multigrid (AMG), the solver of choice in many large-scale PDE-based simulations, scales well in the weak sense, with fixed problem size per node, on tightly coupled systems when loads are well balanced and core performance is reliable. However, its strong scaling to many cores within a node is challenging. Reducing synchronization and increasing concurrency are vital adaptations of AMG to hybrid architectures. Recent communication-reducing improvements to classical additive AMG by Vassilevski and Yang improve concurrency and increase communication-computation overlap, while retaining convergence properties close to those of standard multiplicative AMG, but remain bulk synchronous. We extend the Vassilevski and Yang additive AMG to asynchronous task-based parallelism using a hybrid MPI+OmpSs (from the Barcelona Supercomputer Center) within a node, along with MPI for internode communications. We implement a tiling approach to decompose the grid hierarchy into parallel units within task containers. We compare against the MPI-only BoomerAMG and the Auxiliary-space Maxwell Solver (AMS) in the hypre library for the 3D Laplacian operator and the electromagnetic diffusion, respectively. In time to solution for a full solve an MPI-OmpSs hybrid improves over an all-MPI approach in strong scaling at full core count (32 threads per single Haswell node of the Cray XC40) and maintains this per node advantage as both weak scale to thousands of cores, with MPI between nodes.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129933417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7