Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops最新文献

筛选
英文 中文
An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FX A64FX 上混合使用 MPI 和 OpenMP 任务分配概述
Romain Pereira, A. Roussel, Miwako Tsuji, Patrick Carribault, Mitsuhisa Sato, Hitoshi Murai, Thierry Gautier
{"title":"An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FX","authors":"Romain Pereira, A. Roussel, Miwako Tsuji, Patrick Carribault, Mitsuhisa Sato, Hitoshi Murai, Thierry Gautier","doi":"10.1145/3636480.3637094","DOIUrl":"https://doi.org/10.1145/3636480.3637094","url":null,"abstract":"The adoption of ARM processor architectures is on the rise in the HPC ecosystem. Fugaku supercomputer is a homogeneous ARM-based machine, and is one among the most powerful machine in the world. In the programming world, dependent task-based programming models are gaining tractions due to their many advantages: dynamic load balancing, implicit expression of communication/computation overlap, early-bird communication posting,...MPI and OpenMP are two widespreads programming standards that make possible task-based programming at a distributed memory level. Despite its many advantages, mixed-use of the standard programming models using dependent tasks is still under-evaluated on large-scale machines. In this paper, we provide an overview on mixing OpenMP dependent tasking model with MPI with the state-of-the-art software stack (GCC-13, Clang17, MPC-OMP). We provide the level of performances to expect by porting applications to such mixed-use of the standard on the Fugaku supercomputers, using two benchmarks (Cholesky, HPCCG) and a proxy-application (LULESH). We show that software stack, resource binding and communication progression mechanisms are factors that have a significant impact on performance. On distributed applications, performances reaches up to 80% of effiency for task-based applications like HPCCG. We also point-out a few areas of improvements in OpenMP runtimes.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"4 2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallel Multi-Physics Coupled Simulation of a Midrex Blast Furnace Midrex 高炉的并行多物理场耦合模拟
Xavier Besseron, P. Adhav, Bernhard Peters
{"title":"Parallel Multi-Physics Coupled Simulation of a Midrex Blast Furnace","authors":"Xavier Besseron, P. Adhav, Bernhard Peters","doi":"10.1145/3636480.3636484","DOIUrl":"https://doi.org/10.1145/3636480.3636484","url":null,"abstract":"Traditional steelmaking is a major source of carbon dioxide emissions, but green steel production offers a sustainable alternative. Green steel is produced using hydrogen as a reducing agent instead of carbon monoxide, which results in only water vapour as a by-product. Midrex is a well-established technology that plays a crucial role in the green steel supply chain by producing direct reduced iron (DRI), a more environmentally friendly alternative to traditional iron production methods. In this work, we model a Midrex blast furnace and propose a parallel multi-physics simulation tool based on the coupling between Discrete Element Method (DEM) and Computational Fluid Dynamics (CFD). The particulate phase is simulated with XDEM (parallelized with MPI+OpenMP), the fluid phase is solved by OpenFOAM (parallelized with MPI), and the two solvers are coupled together using the preCICE library. We perform a careful performance analysis that focuses first on each solver individually and then on the coupled application. Our results highlight the difficulty of distributing the computing resources appropriately between the solvers in order to achieve the best performance. Finally, our multi-physics coupled implementation runs in parallel on 1024 cores and can simulate 500 seconds of the Midrex blast furnace in 1 hour and 45 minutes. This work identifies the challenge related to the load balancing of coupled solvers and makes a step forward towards the simulation of a complete 3D blast furnace on High-Performance Computing platforms.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"25 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific Workloads 面向科学工作负载的英伟达™(NVIDIA®)Grace CPU 超级芯片和英伟达™(NVIDIA®)Grace Hopper 超级芯片初体验
N. Simakov, Matthew D. Jones, T. Furlani, E. Siegmann, Robert Harrison
{"title":"First Impressions of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchip for Scientific Workloads","authors":"N. Simakov, Matthew D. Jones, T. Furlani, E. Siegmann, Robert Harrison","doi":"10.1145/3636480.3637097","DOIUrl":"https://doi.org/10.1145/3636480.3637097","url":null,"abstract":"The engineering samples of the NVIDIA Grace CPU Superchip and NVIDIA Grace Hopper Superchips were tested using different benchmarks and scientific applications. The benchmarks include HPCC and HPCG. The real application-based benchmark includes AI-Benchmark-Alpha (a TensorFlow benchmark), Gromacs, OpenFOAM, and ROMS. The performance was compared to multiple Intel, AMD, ARM CPUs and several x86 with NVIDIA GPU systems. A brief energy efficiency estimate was performed based on TDP values. We found that in HPCC benchmark tests, the per-core performance of Grace is similar to or faster than AMD Milan cores, and the high core count often allows NVIDIA Grace CPU Superchip to have per-node performance similar to Intel Sapphire Rapids with High Bandwidth Memory: slower in matrix multiplication (by 17%) and FFT (by 6%), faster in Linpack (by 9%)). In scientific applications, the NVIDIA Grace CPU Superchip performance is slower by 6% to 18% in Gromacs, faster by 7% in OpenFOAM, and right between HBM and DDR modes of Intel Sapphire Rapids in ROMS. The combined CPU-GPU performance in Gromacs is significantly faster (by 20% to 117% faster) than any tested x86-NVIDIA GPU system. Overall, the new NVIDIA Grace Hopper Superchip and NVIDIA Grace CPU Superchip Superchip are high-performance and most likely energy-efficient solutions for HPC centers.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"13 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimize Efficiency of Utilizing Systems by Dynamic Core Binding 通过动态核心绑定优化系统利用效率
Masatoshi Kawai, Akihiro Ida, Toshihiro Hanawa, Tetsuya Hoshino
{"title":"Optimize Efficiency of Utilizing Systems by Dynamic Core Binding","authors":"Masatoshi Kawai, Akihiro Ida, Toshihiro Hanawa, Tetsuya Hoshino","doi":"10.1145/3636480.3637221","DOIUrl":"https://doi.org/10.1145/3636480.3637221","url":null,"abstract":"Load balancing at both the process and thread levels is imperative for minimizing application computation time in the context of MPI/OpenMP hybrid parallelization. This necessity arises from the constraint that, within a typical hybrid parallel environment, an identical number of cores is bound to each process. Dynamic Core Binding, however, adjusts the core binding based on the process’s workload, thereby realizing load balancing at the core level. In prior research, we have implemented the DCB library, which has two policies for computation time reduction or power reduction. In this paper, we show that the two policies provided by the DCB library can be used together to achieve both computation time reduction and power consumption reduction.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"8 13","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introducing software pipelining for the A64FX processor into LLVM 将 A64FX 处理器的软件流水线引入 LLVM
Masaki Arai, Naoto Fukumoto, Hitoshi Murai
{"title":"Introducing software pipelining for the A64FX processor into LLVM","authors":"Masaki Arai, Naoto Fukumoto, Hitoshi Murai","doi":"10.1145/3636480.3637093","DOIUrl":"https://doi.org/10.1145/3636480.3637093","url":null,"abstract":"Software pipelining is an essential optimization for accelerating High-Performance Computing(HPC) applications on CPUs. Modern CPUs achieve high performance through many-core and wide SIMD instructions. Software pipelining is an optimization that promotes further performance improvement of HPC applications by cooperating with these functions. Although open source compilers such as GCC and LLVM have implemented software pipelining, it is underutilized for the AArch64 architecture. We have implemented software pipelining for the A64FX processor on LLVM to improve this situation. This paper describes the details of this implementation. We also confirmed that our implementation improves the performance of several benchmark programs.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"9 35","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139437643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-throughput drug discovery on the Fujitsu A64FX architecture 富士通 A64FX 架构上的高通量药物发现
Filippo Barbari, F. Ficarelli, Daniele Cesarini
{"title":"High-throughput drug discovery on the Fujitsu A64FX architecture","authors":"Filippo Barbari, F. Ficarelli, Daniele Cesarini","doi":"10.1145/3636480.3637095","DOIUrl":"https://doi.org/10.1145/3636480.3637095","url":null,"abstract":"High-performance computational kernels that optimally exploit modern vector-capable processors are critical in running large-scale drug discovery campaigns efficiently and promptly compatible with the constraints posed by urgent computing needs. Yet, state-of-the-art virtual screening workflows focus either on the broadness of features provided to the drug researcher or performance on high-throughput accelerators, leaving the task of deploying efficient CPU kernels to the compiler. We ported the key parts of the LiGen drug discovery pipeline, based on molecular docking, to the Fujitsu A64FX platform and leveraged its vector processing capabilities via an industry-proven retargetable SIMD programming model. By rethinking and optimizing key geometrical docking algorithms to leverage SVE instructions, we are able to provide efficient, high throughput execution on SVE-capable platforms.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"4 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Implementation of Gas-liquid Two-phase Flow Simulations with Surfactant Transport Based on GPU Computing and Adaptive Mesh Refinement 基于 GPU 计算和自适应网格细化的带表面活性剂传输的气液两相流模拟的实现
Tongda Lian, Shintaro Matsushita, Takayuki Aoki
{"title":"The Implementation of Gas-liquid Two-phase Flow Simulations with Surfactant Transport Based on GPU Computing and Adaptive Mesh Refinement","authors":"Tongda Lian, Shintaro Matsushita, Takayuki Aoki","doi":"10.1145/3636480.3636485","DOIUrl":"https://doi.org/10.1145/3636480.3636485","url":null,"abstract":"We proposed an implementation for surfactant transport simulations in gas-liquid two-phase flows. This implementation employs a tree-based interface-adapted adaptive mesh refinement (AMR) method, assigning a high-resolution mesh around the interface region, significantly reducing computational resources, such as memory and execution time. We developed GPU code by CUDA programming language for the AMR method to further enhance performance through GPU parallel computing. The piece-wise linear interface calculation (PLIC) method, an interface-capturing approach for two-phase flows, is implemented based on the tree-based AMR method and GPU computing. We adopted the height function (HF) method to calculate interface curvature for surface tension assessment to suppress the spurious currents, and implemented it on the AMR mesh as well. We incorporated the Langmuir model to describe surfactant transport, as well as surfactant adsorption and desorption at the gas-liquid interface. Our implementation was applied to simulate a two-dimensional process where a bubble freely rises to the liquid surface, forms a thin liquid film, and eventually results in the film’s rupture. This simulation confirmed a reduction in the number of mesh grids required with our proposed implementations.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Performance Evaluation of the Fourth-Generation Xeon with Different Memory Characteristics 采用不同内存特性的第四代至强处理器性能评估
Keiichiro Fukazawa, Riki Takahashi
{"title":"Performance Evaluation of the Fourth-Generation Xeon with Different Memory Characteristics","authors":"Keiichiro Fukazawa, Riki Takahashi","doi":"10.1145/3636480.3637218","DOIUrl":"https://doi.org/10.1145/3636480.3637218","url":null,"abstract":"At the Supercomputer System of Academic Center for Computing and Media Studies Kyoto University, the fourth-generation Xeon (code-named Sapphire Rapids) is employed. The system consists of two subsystems—one equipped solely with high-bandwidth memory, HBM2e, and the other with a large DDR5 memory capacity. Using benchmark applications, a performance evaluation of systems with each type of memory was conducted. Additionally, the study employed a real application, the electromagnetic fluid code, to investigate how application performance varies based on differences in memory characteristics. The results confirm the performance improvement due to the high bandwidth of HBM2e. However, it was also observed that the efficiency is lower when using HBM2e, and the effects of cache memory optimization are relatively minimal.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"10 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Impact of Write-Allocate Elimination on Fujitsu A64FX 取消写入分配对富士通 A64FX 的影响
Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez
{"title":"Impact of Write-Allocate Elimination on Fujitsu A64FX","authors":"Yan Kang, Sayan Ghosh, M. Kandemir, Andrés Marquez","doi":"10.1145/3636480.3637283","DOIUrl":"https://doi.org/10.1145/3636480.3637283","url":null,"abstract":"ARM-based CPU architectures are currently driving massive disruptions in the High Performance Computing (HPC) community. Deployment of the 48-core Fujitsu A64FX ARM architecture based processor in RIKEN “Fugaku” supercomputer (#2 in the June 2023 Top500 list) was a major inflection point in pushing ARM to mainstream HPC. A key design criteria of Fujitsu A64FX is to enhance the throughput of modern memory-bound applications, which happens to be a dominant pattern in contemporary HPC, as opposed to traditional compute-bound or floating-point intensive science workloads. One of the mechanisms to enhance the throughput concerns write-allocate operations (e.g., streaming write operations), which are quite common in science applications. In particular, eliminating write-allocate operations (allocate cache line on a write miss) through a special “zero fill” instruction available on the ARM CPU architectures can improve the overall memory bandwidth, by avoiding the memory read into a cache line, which is unnecessary since the cache line will be written consequently. While bandwidth implications are relatively straightforward to measure via synthetic benchmarks with fixed-stride memory accesses, it is important to consider irregular memory-access driven scenarios such as graph analytics, and analyze the impact of write-allocate elimination on diverse data-driven applications. In this paper, we examine the impact of “zero fill” on OpenMP-based multithreaded graph application scenarios (Graph500 Breadth First Search, GAP benchmark suite, and Louvain Graph Clustering) and five application proxies from the Rodinia heterogeneous benchmark suite (molecular dynamics, sequence alignment, image processing, etc.), using LLVM-based ARM and GNU compilers on the Fujitsu FX700 A64FX platform of the Ookami system from Stony Brook University. Our results indicate that facilitating “zero fill” through code modifications to certain critical kernels or code segments that exhibit temporal write patterns can positively impact the overall performance of a variety of applications. We observe performance variations across the compilers and input data, and note end-to-end improvements between 5–20% for the benchmarks and diverse spectrum of application scenarios owing to “zero fill” related adaptations.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"5 12","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HPCnix: make HPC Apps more easier like shell script HPCnix:让高性能计算应用程序像 shell 脚本一样更简单
Minoru Kanatsu, Hiroshi Yamada
{"title":"HPCnix: make HPC Apps more easier like shell script","authors":"Minoru Kanatsu, Hiroshi Yamada","doi":"10.1145/3636480.3637231","DOIUrl":"https://doi.org/10.1145/3636480.3637231","url":null,"abstract":"In the area of high-performance computing (HPC), it is expected to extract extreme computing performance using a highly optimized framework without even common OS APIs and frameworks for personal desktops. However, this makes the development cost higher than normal application development and challenging for beginners. The demand for large-scale computation is increasing due to the growth of cloud computing environments and the AI boom resulting from deep learning and large-scale language models. Therefore, a framework that makes it easier to handle HPC application programming is needed. This study shows a concept model that makes it possible to write HPC applications using semantics like the shell command pipeline in Unix. It proposes a simple application framework for beginners in HPC called HPCnix.","PeriodicalId":120904,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops","volume":"1 11","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139438395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信