Proceedings of the Platform for Advanced Scientific Computing Conference最新文献

筛选
英文 中文
Evaluating the Arm Ecosystem for High Performance Computing 评估面向高性能计算的Arm生态系统
Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2019-04-08 DOI: 10.1145/3324989.3325722
A. Jackson, Andrew Turner, M. Weiland, N. Johnson, O. Perks, Mark I. Parsons
{"title":"Evaluating the Arm Ecosystem for High Performance Computing","authors":"A. Jackson, Andrew Turner, M. Weiland, N. Johnson, O. Perks, Mark I. Parsons","doi":"10.1145/3324989.3325722","DOIUrl":"https://doi.org/10.1145/3324989.3325722","url":null,"abstract":"In recent years, Arm-based processors have arrived on the HPC scene, offering an alternative the existing status quo, which was largely dominated by x86 processors. In this paper, we evaluate the Arm ecosystem, both the hardware offering and the software stack that is available to users, by benchmarking a production HPC platform that uses Marvell's ThunderX2 processors. We investigate the performance of complex scientific applications across multiple nodes, and we also assess the maturity of the software stack and the ease of use from a users' perspective. This papers finds that the performance across our benchmarking applications is generally as good as, or better, than that of well-established platforms, and we can conclude from our experience that there are no major hurdles that might hinder wider adoption of this ecosystem within the HPC community.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131914207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
A Discontinuous Galerkin Fast Spectral Method for Multi-Species Full Boltzmann on Streaming Multi-Processors 流多处理器上多种全玻尔兹曼的不连续Galerkin快速谱方法
Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2019-03-12 DOI: 10.1145/3324989.3325714
S. Jaiswal, Jingwei Hu, J. Brillon, Alina A. Alexeenko
{"title":"A Discontinuous Galerkin Fast Spectral Method for Multi-Species Full Boltzmann on Streaming Multi-Processors","authors":"S. Jaiswal, Jingwei Hu, J. Brillon, Alina A. Alexeenko","doi":"10.1145/3324989.3325714","DOIUrl":"https://doi.org/10.1145/3324989.3325714","url":null,"abstract":"When the molecules of a gaseous system are far apart, say in microscale gas flows where the surface to volume ratio is high and hence the surface forces dominant, the molecule-surface interactions lead to the formation of a local thermodynamically non-equilibrium region extending few mean free paths from the surface. The dynamics of such systems is accurately described by Boltzmann equation. However, the multi-dimensional nature of Boltzmann equation presents a huge computational challenge. With the recent mathematical developments and the advent of petascale, the dynamics of full Boltzmann equation is now tractable. We present an implementation of the recently introduced multi-species discontinuous Galerkin fast spectral (DGFS) method for solving full Boltzmann on streaming multi-processors. The present implementation solves the inhomogeneous Boltzmann equation in span of few minutes, making it at least two order-of-magnitude faster than the present state-of-art stochastic method---direct simulation Monte Carlo---widely used for solving Boltzmann equation. Various performance metrics, such as weak/strong scaling have been presented. A parallel efficiency of 0.96--0.99 is demonstrated on 36 Nvidia Tesla-P100 GPUs.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127167395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures 使波前算法适应未来架构的抽象和指令
Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176.3218228
Robert Searles, S. Chandrasekaran, W. Joubert, Oscar R. Hernandez
{"title":"Abstractions and Directives for Adapting Wavefront Algorithms to Future Architectures","authors":"Robert Searles, S. Chandrasekaran, W. Joubert, Oscar R. Hernandez","doi":"10.1145/3218176.3218228","DOIUrl":"https://doi.org/10.1145/3218176.3218228","url":null,"abstract":"Architectures are rapidly evolving, and exascale machines are expected to offer billion-way concurrency. We need to rethink algorithms, languages and programming models among other components in order to migrate large scale applications and explore parallelism on these machines. Although directive-based programming models allow programmers to worry less about programming and more about science, expressing complex parallel patterns in these models can be a daunting task especially when the goal is to match the performance that the hardware platforms can offer. One such pattern is wavefront. This paper extensively studies a wavefront-based miniapplication for Denovo, a production code for nuclear reactor modeling. We parallelize the Koch-Baker-Alcouffe (KBA) parallel-wavefront sweep algorithm in the main kernel of Minisweep (the miniapplication) using CUDA, OpenMP and OpenACC. Our OpenACC implementation running on NVIDIA's next-generation Volta GPU boasts an 85.06x speedup over serial code, which is larger than CUDA's 83.72x speedup over the same serial implementation. Our experimental platform includes SummitDev, an ORNL representative architecture of the upcoming Summit supercomputer. Our parallelization effort across platforms also motivated us to define an abstract parallelism model that is architecture independent, with a goal of creating software abstractions that can be used by applications employing the wavefront sweep motif.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126733535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Distributed, Shared-Memory Parallel Triangle Counting 分布式、共享内存并行三角形计数
Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176.3218229
Thejaka Amila Kanewala, Marcin Zalewski, A. Lumsdaine
{"title":"Distributed, Shared-Memory Parallel Triangle Counting","authors":"Thejaka Amila Kanewala, Marcin Zalewski, A. Lumsdaine","doi":"10.1145/3218176.3218229","DOIUrl":"https://doi.org/10.1145/3218176.3218229","url":null,"abstract":"Triangles are the most basic non-trivial subgraphs. Triangle counting is used in a number of different applications, including social network mining, cyber security, and spam detection. In general, triangle counting algorithms are readily parallelizable, but when implemented in distributed, shared-memory, their performance is poor due to high communication, imbalance of work, and the difficulty of exploiting locality available in shared memory. In this paper, we discuss four different (but related) triangle counting algorithms and how their performance can be improved in distributed, shared-memory by reducing in-node load imbalance, improving cache utilization, minimizing network overhead, and minimizing algorithmic work. We generalize the four different triangle counting algorithms into a common framework and show that for all four algorithms the in-node load imbalance can be minimized while utilizing caches by partitioning work into blocks of vertices, the network overhead can be minimized by aggregation of blocks of work, and algorithm work can be reduced by partitioning vertex neighbors by degree. We experimentally evaluate the weak and the strong scaling performance of the proposed algorithms with two types of synthetic graph inputs and three real-world graph inputs. We also compare the performance of our implementations with the distributed, shared-memory triangle counting algorithms available in PowerGraph-GraphLab and show that our proposed algorithms outperform those algorithms, both in terms of space and time.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115525014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
MRG8 MRG8
Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176.3218230
Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka, K. Miura, J. Shalf
{"title":"MRG8","authors":"Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka, K. Miura, J. Shalf","doi":"10.1145/3218176.3218230","DOIUrl":"https://doi.org/10.1145/3218176.3218230","url":null,"abstract":"Pseudo random number generators (PRNGs) are crucial for numerous applications in HPC ranging from molecular dynamics to quantum chemistry, and hydrodynamics. These applications require high throughput and good statistical quality from the PRNGs - especially for parallel computing where long pseudo-random sequences can be exhausted rapidly. Although a handful PRNGs have been adapted to parallel computing, they do not fully exploit the features of wide-SIMD many-core processors and GPU accelerators in modern supercomputers. Multiple Recursive Generators (MRGs) are a family of random number generators based on higher order recursion, which provide statistically high-quality random number sequences with extremely long-recurrence lengths, and deterministic jump-ahead for effective parallelism. We reformulate the MRG8 (8th-order recursive implementation) for Intel's KNL and NVIDIA's P100 GPU - named MRG8-AVX512 and MRG8-GPU respectively. Our optimized implementation generates the same random number sequence as the original well-characterized MRG8. We evaluated MRG8-AVX512 and MRG8-GPU together with vender tuned random number generators for Intel KNL and GPU. MRG8-AVX512 achieves a substantial 69% improvement compared to Intel's MKL, and MRG8-GPU shows a maximum 3.36x speedup compared to NVIDIA's cuRAND library.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122554797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proceedings of the Platform for Advanced Scientific Computing Conference 先进科学计算平台会议录
{"title":"Proceedings of the Platform for Advanced Scientific Computing Conference","authors":"","doi":"10.1145/3218176","DOIUrl":"https://doi.org/10.1145/3218176","url":null,"abstract":"","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122657666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Balanced Graph Partition Refinement using the Graph p-Laplacian 基于图p-拉普拉斯算子的平衡图划分精化
Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176.3218232
Toby Simpson, D. Pasadakis, D. Kourounis, K. Fujita, Takuma Yamaguchi, T. Ichimura, O. Schenk
{"title":"Balanced Graph Partition Refinement using the Graph p-Laplacian","authors":"Toby Simpson, D. Pasadakis, D. Kourounis, K. Fujita, Takuma Yamaguchi, T. Ichimura, O. Schenk","doi":"10.1145/3218176.3218232","DOIUrl":"https://doi.org/10.1145/3218176.3218232","url":null,"abstract":"A continuous formulation of the optimal 2-way graph partitioning based on the p-norm minimization of the graph Laplacian Rayleigh quotient is presented, which provides a sharp approximation to the balanced graph partitioning problem, the optimality of which is known to be NP-hard. The minimization is initialized from a cut provided by a state-of-the-art multilevel recursive bisection algorithm, and then a continuation approach reduces the p-norm from a 2-norm towards a 1-norm, employing for each value of p a feasibility-preserving steepest-descent method that converges on the p-Laplacian eigenvector. A filter favors iterates advancing towards minimum edgecut and partition load imbalance. The complexity of the suggested approach is linear in graph edges. The simplicity of the steepest-descent algorithm renders the overall approach highly scalable and efficient in parallel distributed architectures. Parallel implementation of recursive bisection on multi-core CPUs and GPUs are presented for large-scale graphs with up to 1.9 billion tetrahedra. The suggested approach exhibits improvements of up to 52.8% over METIS for graphs originating from triangular Delaunay meshes, 34.7% over METIS and 21.9% over KaHIP for power network graphs, 40.8% over METIS and 20.6% over KaHIP for sparse matrix graphs, and finally 93.2% over METIS for graphs emerging from social networks.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131117744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
The CLAW DSL: Abstractions for Performance Portable Weather and Climate Models CLAW DSL:性能便携式天气和气候模型的抽象
Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176.3218226
Valentin Clement, S. Ferrachat, O. Fuhrer, X. Lapillonne, C. Osuna, R. Pincus, Jonathan S. Rood, W. Sawyer
{"title":"The CLAW DSL: Abstractions for Performance Portable Weather and Climate Models","authors":"Valentin Clement, S. Ferrachat, O. Fuhrer, X. Lapillonne, C. Osuna, R. Pincus, Jonathan S. Rood, W. Sawyer","doi":"10.1145/3218176.3218226","DOIUrl":"https://doi.org/10.1145/3218176.3218226","url":null,"abstract":"In order to profit from emerging high-performance computing systems, weather and climate models need to be adapted to run efficiently on different hardware architectures such as accelerators. This is a major challenge for existing community models that represent very large code bases written in Fortran. We introduce the CLAW domain-specific language (CLAW DSL) and the CLAW Compiler that allows the retention of a single code written in Fortran and achieve a high degree of performance portability. Specifically, we present the Single Column Abstraction (SCA) of the CLAW DSL that is targeted at the column-based algorithmic motifs typically encountered in the physical parameterizations of weather and climate models. Starting from a serial and non-optimized source code, the CLAW Compiler applies transformations and optimizations for a specific target hardware architecture and generates parallel optimized Fortran code annotated with OpenMP or OpenACC directives. Results from a state-of-the-art radiative transfer code, indicate that using CLAW, the amount of source code can be significantly reduced while achieving efficient code for x86 multi-core CPUs and GPU accelerators. The CLAW DSL is a significant step towards performance portable climate and weather model and could be adopted incrementally in existing code with limited effort.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127032435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Extreme Computing for Extreme Adaptive Optics: The Key to Finding Life Outside our Solar System 极端自适应光学的极端计算:寻找太阳系外生命的关键
Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2018-07-02 DOI: 10.1145/3218176.3218225
H. Ltaief, D. Sukkari, O. Guyon, D. Keyes
{"title":"Extreme Computing for Extreme Adaptive Optics: The Key to Finding Life Outside our Solar System","authors":"H. Ltaief, D. Sukkari, O. Guyon, D. Keyes","doi":"10.1145/3218176.3218225","DOIUrl":"https://doi.org/10.1145/3218176.3218225","url":null,"abstract":"The real-time correction of telescopic images in the search for exoplanets is highly sensitive to atmospheric aberrations. The pseudo-inverse algorithm is an efficient mathematical method to filter out these turbulences. We introduce a new partial singular value decomposition (SVD) algorithm based on QR-based Diagonally Weighted Halley (QDWH) iteration for the pseudo-inverse method of adaptive optics. The QDWH partial SVD algorithm selectively calculates the most significant singular values and their corresponding singular vectors. We develop a high performance implementation and demonstrate the numerical robustness of the QDWH-based partial SVD method. We also perform a benchmarking campaign on various generations of GPU hardware accelerators and compare against the state-of-the-art SVD implementation SGESDD from the MAGMA library. Numerical accuracy and performance results are reported using synthetic and real observational datasets from the Subaru telescope. Our implementation outperforms SGESDD by up to fivefold and fourfold performance speedups on ill-conditioned synthetic matrices and real observational datasets, respectively. The pseudo-inverse simulation code will be deployed on-sky for the Subaru telescope during observation nights scheduled early 2018.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"310 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116489003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Asynchronous Task-Based Parallelization of Algebraic Multigrid 基于异步任务的代数多重网格并行化
Proceedings of the Platform for Advanced Scientific Computing Conference Pub Date : 2017-06-26 DOI: 10.1145/3093172.3093230
Amani Alonazi, George S. Markomanolis, D. Keyes
{"title":"Asynchronous Task-Based Parallelization of Algebraic Multigrid","authors":"Amani Alonazi, George S. Markomanolis, D. Keyes","doi":"10.1145/3093172.3093230","DOIUrl":"https://doi.org/10.1145/3093172.3093230","url":null,"abstract":"As processor clock rates become more dynamic and workloads become more adaptive, the vulnerability to global synchronization that already complicates programming for performance in today's petascale environment will be exacerbated. Algebraic multigrid (AMG), the solver of choice in many large-scale PDE-based simulations, scales well in the weak sense, with fixed problem size per node, on tightly coupled systems when loads are well balanced and core performance is reliable. However, its strong scaling to many cores within a node is challenging. Reducing synchronization and increasing concurrency are vital adaptations of AMG to hybrid architectures. Recent communication-reducing improvements to classical additive AMG by Vassilevski and Yang improve concurrency and increase communication-computation overlap, while retaining convergence properties close to those of standard multiplicative AMG, but remain bulk synchronous. We extend the Vassilevski and Yang additive AMG to asynchronous task-based parallelism using a hybrid MPI+OmpSs (from the Barcelona Supercomputer Center) within a node, along with MPI for internode communications. We implement a tiling approach to decompose the grid hierarchy into parallel units within task containers. We compare against the MPI-only BoomerAMG and the Auxiliary-space Maxwell Solver (AMS) in the hypre library for the 3D Laplacian operator and the electromagnetic diffusion, respectively. In time to solution for a full solve an MPI-OmpSs hybrid improves over an all-MPI approach in strong scaling at full core count (32 threads per single Haswell node of the Cray XC40) and maintains this per node advantage as both weak scale to thousands of cores, with MPI between nodes.","PeriodicalId":174137,"journal":{"name":"Proceedings of the Platform for Advanced Scientific Computing Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129933417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信