Romain Pereira, A. Roussel, Miwako Tsuji, Patrick Carribault, Mitsuhisa Sato, Hitoshi Murai, Thierry Gautier
{"title":"An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FX","authors":"Romain Pereira, A. Roussel, Miwako Tsuji, Patrick Carribault, Mitsuhisa Sato, Hitoshi Murai, Thierry Gautier","doi":"10.1145/3636480.3637094","DOIUrl":"https://doi.org/10.1145/3636480.3637094","url":null,"abstract":"The adoption of ARM processor architectures is on the rise in the HPC ecosystem. Fugaku supercomputer is a homogeneous ARM-based machine, and is one among the most powerful machine in the world. In the programming world, dependent task-based programming models are gaining tractions due to their many advantages: dynamic load balancing, implicit expression of communication/computation overlap, early-bird communication posting,...MPI and OpenMP are two widespreads programming standards that make possible task-based programming at a distributed memory level. Despite its many advantages, mixed-use of the standard programming models using dependent tasks is still under-evaluated on large-scale machines. In this paper, we provide an overview on mixing OpenMP dependent tasking model with MPI with the state-of-the-art software stack (GCC-13, Clang17, MPC-OMP). We provide the level of performances to expect by porting applications to such mixed-use of the standard on the Fugaku supercomputers, using two benchmarks (Cholesky, HPCCG) and a proxy-application (LULESH). We show that software stack, resource binding and communication progression mechanisms are factors that have a significant impact on performance. On distributed applications, performances reaches up to 80% of effiency for task-based applications like HPCCG. We also point-out a few areas of improvements in OpenMP runtimes.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"4 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Multi-Physics Coupled Simulation of a Midrex Blast Furnace","authors":"Xavier Besseron, P. Adhav, Bernhard Peters","doi":"10.1145/3636480.3636484","DOIUrl":"https://doi.org/10.1145/3636480.3636484","url":null,"abstract":"Traditional steelmaking is a major source of carbon dioxide emissions, but green steel production offers a sustainable alternative. Green steel is produced using hydrogen as a reducing agent instead of carbon monoxide, which results in only water vapour as a by-product. Midrex is a well-established technology that plays a crucial role in the green steel supply chain by producing direct reduced iron (DRI), a more environmentally friendly alternative to traditional iron production methods. In this work, we model a Midrex blast furnace and propose a parallel multi-physics simulation tool based on the coupling between Discrete Element Method (DEM) and Computational Fluid Dynamics (CFD). The particulate phase is simulated with XDEM (parallelized with MPI+OpenMP), the fluid phase is solved by OpenFOAM (parallelized with MPI), and the two solvers are coupled together using the preCICE library. We perform a careful performance analysis that focuses first on each solver individually and then on the coupled application. Our results highlight the difficulty of distributing the computing resources appropriately between the solvers in order to achieve the best performance. Finally, our multi-physics coupled implementation runs in parallel on 1024 cores and can simulate 500 seconds of the Midrex blast furnace in 1 hour and 45 minutes. This work identifies the challenge related to the load balancing of coupled solvers and makes a step forward towards the simulation of a complete 3D blast furnace on High-Performance Computing platforms.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"25 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Simakov, Matthew D. Jones, T. Furlani, E. Siegmann, Robert Harrison
{"title":"First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific Workloads","authors":"N. Simakov, Matthew D. Jones, T. Furlani, E. Siegmann, Robert Harrison","doi":"10.1145/3636480.3637097","DOIUrl":"https://doi.org/10.1145/3636480.3637097","url":null,"abstract":"The engineering samples of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchips were tested using different benchmarks and scientific applications. The benchmarks include HPCC and HPCG. The real application-based benchmark includes AI-Benchmark-Alpha (a TensorFlow benchmark), Gromacs, OpenFOAM, and ROMS. The performance was compared to multiple Intel, AMD, ARM CPUs and several x86 with NVIDIA GPU systems. A brief energy efficiency estimate was performed based on TDP values. We found that in HPCC benchmark tests, the per-core performance of Grace is similar to or faster than AMD Milan cores, and the high core count often allows NVIDIA Grace CPU Superchip to have per-node performance similar to Intel Sapphire Rapids with High Bandwidth Memory: slower in matrix multiplication (by 17%) and FFT (by 6%), faster in Linpack (by 9%)). In scientific applications, the NVIDIA Grace CPU Superchip performance is slower by 6% to 18% in Gromacs, faster by 7% in OpenFOAM, and right between HBM and DDR modes of Intel Sapphire Rapids in ROMS. The combined CPU-GPU performance in Gromacs is significantly faster (by 20% to 117% faster) than any tested x86-NVIDIA GPU system. Overall, the new NVIDIA Grace Hopper Superchip and NVIDIA Grace CPU Superchip Superchip are high-performance and most likely energy-efficient solutions for HPC centers.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"13 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimize Efficiency of Utilizing Systems by Dynamic Core Binding","authors":"Masatoshi Kawai, Akihiro Ida, Toshihiro Hanawa, Tetsuya Hoshino","doi":"10.1145/3636480.3637221","DOIUrl":"https://doi.org/10.1145/3636480.3637221","url":null,"abstract":"Load balancing at both the process and thread levels is imperative for minimizing application computation time in the context of MPI/OpenMP hybrid parallelization. This necessity arises from the constraint that, within a typical hybrid parallel environment, an identical number of cores is bound to each process. Dynamic Core Binding, however, adjusts the core binding based on the process’s workload, thereby realizing load balancing at the core level. In prior research, we have implemented the DCB library, which has two policies for computation time reduction or power reduction. In this paper, we show that the two policies provided by the DCB library can be used together to achieve both computation time reduction and power consumption reduction.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"8 13","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introducing software pipelining for the A64FX processor into LLVM","authors":"Masaki Arai, Naoto Fukumoto, Hitoshi Murai","doi":"10.1145/3636480.3637093","DOIUrl":"https://doi.org/10.1145/3636480.3637093","url":null,"abstract":"Software pipelining is an essential optimization for accelerating High-Performance Computing(HPC) applications on CPUs. Modern CPUs achieve high performance through many-core and wide SIMD instructions. Software pipelining is an optimization that promotes further performance improvement of HPC applications by cooperating with these functions. Although open source compilers such as GCC and LLVM have implemented software pipelining, it is underutilized for the AArch64 architecture. We have implemented software pipelining for the A64FX processor on LLVM to improve this situation. This paper describes the details of this implementation. We also confirmed that our implementation improves the performance of several benchmark programs.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"9 35","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139437643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-throughput drug discovery on the Fujitsu A64FX architecture","authors":"Filippo Barbari, F. Ficarelli, Daniele Cesarini","doi":"10.1145/3636480.3637095","DOIUrl":"https://doi.org/10.1145/3636480.3637095","url":null,"abstract":"High-performance computational kernels that optimally exploit modern vector-capable processors are critical in running large-scale drug discovery campaigns efficiently and promptly compatible with the constraints posed by urgent computing needs. Yet, state-of-the-art virtual screening workflows focus either on the broadness of features provided to the drug researcher or performance on high-throughput accelerators, leaving the task of deploying efficient CPU kernels to the compiler. We ported the key parts of the LiGen drug discovery pipeline, based on molecular docking, to the Fujitsu A64FX platform and leveraged its vector processing capabilities via an industry-proven retargetable SIMD programming model. By rethinking and optimizing key geometrical docking algorithms to leverage SVE instructions, we are able to provide efficient, high throughput execution on SVE-capable platforms.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"4 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Implementation of Gas-liquid Two-phase Flow Simulations with Surfactant Transport Based on GPU Computing and Adaptive Mesh Refinement","authors":"Tongda Lian, Shintaro Matsushita, Takayuki Aoki","doi":"10.1145/3636480.3636485","DOIUrl":"https://doi.org/10.1145/3636480.3636485","url":null,"abstract":"We proposed an implementation for surfactant transport simulations in gas-liquid two-phase flows. This implementation employs a tree-based interface-adapted adaptive mesh refinement (AMR) method, assigning a high-resolution mesh around the interface region, significantly reducing computational resources, such as memory and execution time. We developed GPU code by CUDA programming language for the AMR method to further enhance performance through GPU parallel computing. The piece-wise linear interface calculation (PLIC) method, an interface-capturing approach for two-phase flows, is implemented based on the tree-based AMR method and GPU computing. We adopted the height function (HF) method to calculate interface curvature for surface tension assessment to suppress the spurious currents, and implemented it on the AMR mesh as well. We incorporated the Langmuir model to describe surfactant transport, as well as surfactant adsorption and desorption at the gas-liquid interface. Our implementation was applied to simulate a two-dimensional process where a bubble freely rises to the liquid surface, forms a thin liquid film, and eventually results in the film’s rupture. This simulation confirmed a reduction in the number of mesh grids required with our proposed implementations.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Evaluation of the Fourth-Generation Xeon with Different Memory Characteristics","authors":"Keiichiro Fukazawa, Riki Takahashi","doi":"10.1145/3636480.3637218","DOIUrl":"https://doi.org/10.1145/3636480.3637218","url":null,"abstract":"At the Supercomputer System of Academic Center for Computing and Media Studies Kyoto University, the fourth-generation Xeon (code-named Sapphire Rapids) is employed. The system consists of two subsystems—one equipped solely with high-bandwidth memory, HBM2e, and the other with a large DDR5 memory capacity. Using benchmark applications, a performance evaluation of systems with each type of memory was conducted. Additionally, the study employed a real application, the electromagnetic fluid code, to investigate how application performance varies based on differences in memory characteristics. The results confirm the performance improvement due to the high bandwidth of HBM2e. However, it was also observed that the efficiency is lower when using HBM2e, and the effects of cache memory optimization are relatively minimal.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"10 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez
{"title":"Impact of Write-Allocate Elimination on Fujitsu A64FX","authors":"Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez","doi":"10.1145/3636480.3637283","DOIUrl":"https://doi.org/10.1145/3636480.3637283","url":null,"abstract":"ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"5 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HPCnix: make HPC Apps more easier like shell script","authors":"Minoru Kanatsu, Hiroshi Yamada","doi":"10.1145/3636480.3637231","DOIUrl":"https://doi.org/10.1145/3636480.3637231","url":null,"abstract":"In the area of high-performance computing (HPC), it is expected to extract extreme computing performance using a highly optimized framework without even common OS APIs and frameworks for personal desktops. However, this makes the development cost higher than normal application development and challenging for beginners. The demand for large-scale computation is increasing due to the growth of cloud computing environments and the AI boom resulting from deep learning and large-scale language models. Therefore, a framework that makes it easier to handle HPC application programming is needed. This study shows a concept model that makes it possible to write HPC applications using semantics like the shell command pipeline in Unix. It proposes a simple application framework for beginners in HPC called HPCnix.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"1 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}