Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops最新文献

An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FX A64FX 上混合使用 MPI 和 OpenMP 任务分配概述

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637094

Romain Pereira, A. Roussel, Miwako Tsuji, Patrick Carribault, Mitsuhisa Sato, Hitoshi Murai, Thierry Gautier

{"title":"An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FX","authors":"Romain Pereira, A. Roussel, Miwako Tsuji, Patrick Carribault, Mitsuhisa Sato, Hitoshi Murai, Thierry Gautier","doi":"10.1145/3636480.3637094","DOIUrl":"https://doi.org/10.1145/3636480.3637094","url":null,"abstract":"The adoption of ARM processor architectures is on the rise in the HPC ecosystem. Fugaku supercomputer is a homogeneous ARM-based machine, and is one among the most powerful machine in the world. In the programming world, dependent task-based programming models are gaining tractions due to their many advantages: dynamic load balancing, implicit expression of communication/computation overlap, early-bird communication posting,...MPI and OpenMP are two widespreads programming standards that make possible task-based programming at a distributed memory level. Despite its many advantages, mixed-use of the standard programming models using dependent tasks is still under-evaluated on large-scale machines. In this paper, we provide an overview on mixing OpenMP dependent tasking model with MPI with the state-of-the-art software stack (GCC-13, Clang17, MPC-OMP). We provide the level of performances to expect by porting applications to such mixed-use of the standard on the Fugaku supercomputers, using two benchmarks (Cholesky, HPCCG) and a proxy-application (LULESH). We show that software stack, resource binding and communication progression mechanisms are factors that have a significant impact on performance. On distributed applications, performances reaches up to 80% of effiency for task-based applications like HPCCG. We also point-out a few areas of improvements in OpenMP runtimes.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"4 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Parallel Multi-Physics Coupled Simulation of a Midrex Blast Furnace Midrex 高炉的并行多物理场耦合模拟

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops Pub Date : 2024-01-11 DOI: 10.1145/3636480.3636484

Xavier Besseron, P. Adhav, Bernhard Peters

{"title":"Parallel Multi-Physics Coupled Simulation of a Midrex Blast Furnace","authors":"Xavier Besseron, P. Adhav, Bernhard Peters","doi":"10.1145/3636480.3636484","DOIUrl":"https://doi.org/10.1145/3636480.3636484","url":null,"abstract":"Traditional steelmaking is a major source of carbon dioxide emissions, but green steel production offers a sustainable alternative. Green steel is produced using hydrogen as a reducing agent instead of carbon monoxide, which results in only water vapour as a by-product. Midrex is a well-established technology that plays a crucial role in the green steel supply chain by producing direct reduced iron (DRI), a more environmentally friendly alternative to traditional iron production methods. In this work, we model a Midrex blast furnace and propose a parallel multi-physics simulation tool based on the coupling between Discrete Element Method (DEM) and Computational Fluid Dynamics (CFD). The particulate phase is simulated with XDEM (parallelized with MPI+OpenMP), the fluid phase is solved by OpenFOAM (parallelized with MPI), and the two solvers are coupled together using the preCICE library. We perform a careful performance analysis that focuses first on each solver individually and then on the coupled application. Our results highlight the difficulty of distributing the computing resources appropriately between the solvers in order to achieve the best performance. Finally, our multi-physics coupled implementation runs in parallel on 1024 cores and can simulate 500 seconds of the Midrex blast furnace in 1 hour and 45 minutes. This work identifies the challenge related to the load balancing of coupled solvers and makes a step forward towards the simulation of a complete 3D blast furnace on High-Performance Computing platforms.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"25 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific Workloads 面向科学工作负载的英伟达™（NVIDIA®）Grace CPU 超级芯片和英伟达™（NVIDIA®）Grace Hopper 超级芯片初体验

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637097

N. Simakov, Matthew D. Jones, T. Furlani, E. Siegmann, Robert Harrison

{"title":"First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific Workloads","authors":"N. Simakov, Matthew D. Jones, T. Furlani, E. Siegmann, Robert Harrison","doi":"10.1145/3636480.3637097","DOIUrl":"https://doi.org/10.1145/3636480.3637097","url":null,"abstract":"The engineering samples of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchips were tested using different benchmarks and scientific applications. The benchmarks include HPCC and HPCG. The real application-based benchmark includes AI-Benchmark-Alpha (a TensorFlow benchmark), Gromacs, OpenFOAM, and ROMS. The performance was compared to multiple Intel, AMD, ARM CPUs and several x86 with NVIDIA GPU systems. A brief energy efficiency estimate was performed based on TDP values. We found that in HPCC benchmark tests, the per-core performance of Grace is similar to or faster than AMD Milan cores, and the high core count often allows NVIDIA Grace CPU Superchip to have per-node performance similar to Intel Sapphire Rapids with High Bandwidth Memory: slower in matrix multiplication (by 17%) and FFT (by 6%), faster in Linpack (by 9%)). In scientific applications, the NVIDIA Grace CPU Superchip performance is slower by 6% to 18% in Gromacs, faster by 7% in OpenFOAM, and right between HBM and DDR modes of Intel Sapphire Rapids in ROMS. The combined CPU-GPU performance in Gromacs is significantly faster (by 20% to 117% faster) than any tested x86-NVIDIA GPU system. Overall, the new NVIDIA Grace Hopper Superchip and NVIDIA Grace CPU Superchip Superchip are high-performance and most likely energy-efficient solutions for HPC centers.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"13 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimize Efficiency of Utilizing Systems by Dynamic Core Binding 通过动态核心绑定优化系统利用效率

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637221

Masatoshi Kawai, Akihiro Ida, Toshihiro Hanawa, Tetsuya Hoshino

引用次数: 0

Introducing software pipelining for the A64FX processor into LLVM 将 A64FX 处理器的软件流水线引入 LLVM

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637093

Masaki Arai, Naoto Fukumoto, Hitoshi Murai

引用次数: 0

High-throughput drug discovery on the Fujitsu A64FX architecture 富士通 A64FX 架构上的高通量药物发现

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637095

Filippo Barbari, F. Ficarelli, Daniele Cesarini

引用次数: 0

The Implementation of Gas-liquid Two-phase Flow Simulations with Surfactant Transport Based on GPU Computing and Adaptive Mesh Refinement 基于 GPU 计算和自适应网格细化的带表面活性剂传输的气液两相流模拟的实现

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops Pub Date : 2024-01-11 DOI: 10.1145/3636480.3636485

Tongda Lian, Shintaro Matsushita, Takayuki Aoki

{"title":"The Implementation of Gas-liquid Two-phase Flow Simulations with Surfactant Transport Based on GPU Computing and Adaptive Mesh Refinement","authors":"Tongda Lian, Shintaro Matsushita, Takayuki Aoki","doi":"10.1145/3636480.3636485","DOIUrl":"https://doi.org/10.1145/3636480.3636485","url":null,"abstract":"We proposed an implementation for surfactant transport simulations in gas-liquid two-phase flows. This implementation employs a tree-based interface-adapted adaptive mesh refinement (AMR) method, assigning a high-resolution mesh around the interface region, significantly reducing computational resources, such as memory and execution time. We developed GPU code by CUDA programming language for the AMR method to further enhance performance through GPU parallel computing. The piece-wise linear interface calculation (PLIC) method, an interface-capturing approach for two-phase flows, is implemented based on the tree-based AMR method and GPU computing. We adopted the height function (HF) method to calculate interface curvature for surface tension assessment to suppress the spurious currents, and implemented it on the AMR mesh as well. We incorporated the Langmuir model to describe surfactant transport, as well as surfactant adsorption and desorption at the gas-liquid interface. Our implementation was applied to simulate a two-dimensional process where a bubble freely rises to the liquid surface, forms a thin liquid film, and eventually results in the film’s rupture. This simulation confirmed a reduction in the number of mesh grids required with our proposed implementations.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance Evaluation of the Fourth-Generation Xeon with Different Memory Characteristics 采用不同内存特性的第四代至强处理器性能评估

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637218

Keiichiro Fukazawa, Riki Takahashi

引用次数: 0

Impact of Write-Allocate Elimination on Fujitsu A64FX 取消写入分配对富士通 A64FX 的影响

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637283

Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez

{"title":"Impact of Write-Allocate Elimination on Fujitsu A64FX","authors":"Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez","doi":"10.1145/3636480.3637283","DOIUrl":"https://doi.org/10.1145/3636480.3637283","url":null,"abstract":"ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"5 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HPCnix: make HPC Apps more easier like shell script HPCnix：让高性能计算应用程序像 shell 脚本一样更简单

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops Pub Date : 2024-01-11 DOI: 10.1145/3636480.3637231

Minoru Kanatsu, Hiroshi Yamada

引用次数: 0